6.1. Finding data#

Congratulations! You’ve decided to do a data science project! Two common starts for any data science project are generally you have some data and are trying to think about what to do with it, or you have a research question that you’d like to explore, but need to go find some data to help answer it. In the case of the latter, we generally recommend taking a moment to first think about:

  1. What sort of question would you like to answer with data? Can you state it as specifically as possible? (You can expand it later, but generally it’s helpful to start with something more manageable as you begin a project.)

  2. What would your ideal dataset look like to help you answer this question? That is, what would the ideal units of analysis be, and what variables would you want to have?

Having some clarity about these two questions before you take to the internet searching for data will help you find what you’re looking for – or realize that what you’re looking for doesn’t (yet) exist – more quickly. And, being able to visualize how things we care about in the world – say, education, climate change, athlete performance, or whatever is important to you – might look in a dataset is one of the most fundamental skills of being a data scientist. A dataset is just one specific snapshot of one part of the world at one specific moment in time – and being able to translate the complicated, messy world into even an imagined dataset is the first step in thinking about how data is constructed, what it represents, and whether we’re really representing the elements of the world that we think we are.

On the other hand, if you’re beginning your data science research from the perspective of having a dataset that either your professors will ask you to do something with (as we’ll do in this book), or about which you’ll need to figure out what sorts of analyses to conduct, you’ll still need to evaluate that dataset. While the rest of this section of the chapter focuses on data in the world, everything that follows will be useful for evaluating all data – whether you found it on your own, had it handed to you by someone else, or even collected and built it yourself.

Data out there in the world#

Suppose you take to the internet to look for some data (which we hope you’ll do as you continue to gain skills in analyzing data – the goal very much is to give you data science techniques you can use to explore questions that matter to you in a thoughtful, ethical, creative, and rigorous way after this course is over).

There are four general types of data you’ll find.

1. Data in great shape#

This is, perhaps surprisingly, relatively rare. Data that’s in great shape means it’s data that is organized in a way that’s relatively readily usable – for our purposes this will generally mean that it’s organized into observations and variables (or, structured data), and that it’s either in the units of analysis you need, or are able to be transformed into those units. For example, suppose you wish to study the annual GDP of OECD countries, but you find data on the monthly GDP for each. It would require very little code to transform this into annual data (though going the other direction would be more problematic).

Data in great shape also generally means that the contents of the dataset are generally reliable and clean. We’ll be more specific about this later in the chapter, but to build some intuition this means generally means that if, for example, you have data that includes the names of athletes or senators, that names are written in the same format (e.g., First name, last name; middle initials are either systematically included or not, and so on). Or for numeric data, you may want to check that the data is on a similar scale within each variable (as opposed to some observations being labeled in inches while others are in feet). Of course, we also hope there are minimal missing variables, and if they are missing, that it’s random.

These qualities of “great shape” thus far fall under the nice-to-have but not (necessarily) dealbreakers. They generally just mean more data cleaning (or “pre-processing” or “wrangling”) for you. But one factor that is a non-negotiable for data to be usable is that it has a codebook, or a data dictionary, which we’ll discuss in a few moments.

2. Data that needs some (or maybe a lot of) work#

If you do find data that is related to your area of interest and seems to hold promise, that’s always a great first step, but often you may find that the data needs a serious amount of work. One common way this shows up is that you find the data you need, but it’s spread across different datasets, often from different sources. This means you’ll need to manually combine the datasets in order to conduct your analysis, and depending on how the data is structured and how compatible they are, this could take just a few minutes (unlikely) to a serious amount of time (from experience we can at least tell you it almost certainly will take longer than you think).

Another way data can need a lot of work – and this can be common if you’re working in an area that doesn’t have a ton of quantitative attention – is that it might be out there in the form of information, but not organized into a dataset at all. For example, perhaps you are looking for data about the quarterly earnings of a company, but they only release them as PDFs each quarter. The information is out there, but you’ll have to put it together, perhaps manually, or perhaps by writing a script that pulls the information (alas, we won’t cover that in this book, but we will give you the skills to go out and find code that could do it, and modify it as you need; as we’ve said (many times) elsewhere, this kind of post-course self-sufficiency is one of our top goals).

As this is not an explicitly data wrangling course, we won’t spend a ton of time talking about how to construct datasets from information that’s in principle out there in the world. But we do want you to be aware as you go into data science that you’re likely going to spend a lot of time on this part unless you’re in a situation where data is explicilty being given to you.

3. Garbage data#

This is unfortunately extremely common. There’s no doubt that there is a ton of data out in the world. And while we don’t have an official estimate, in our experience it’s pretty safe to assume that a lot of it is not good and you should not use it.

How do you know if data is not good? Well, that’s the subject of the rest of the chapter, but one big thing to look for first that will immediately render data useless is if it is not documented in the form of a codebook (sometimes called a data dictionary, though we will mostly use codebook here).

As we’ll see in the rest of the chapter, in order to evaluate the quality of our data and understand how to adjust our inferences as a result, we need to know where the data comes from. We need to know how it was measured, how it was collected, over what time period it was collected, and ideally who did the collecting. Data that doesn’t have documentation that carefully covers this information unfortunately should not be used – if you don’t know what you’re analyzing, the risks of faulty inferences becomes huge.

4. Nothing at all#

Perhaps surprisingly, despite all the buzz about big data and data lakes and data everything, there is a lot out there that has not yet been turned into data. You might be surprised as you begin your data science journey how little data there actually is on many subjects you care about. There may be a variety of reasons for this – perhaps the subject you’re interested in (e.g., “justice” or “trust”) is difficult to measure because of its complexity or abstraction. Or, there may be ethical, practical, or compliance challenges, much like we saw with randomized controlled experiments – as informative as it might be to have data on what medicines people take, or what grades students get in courses in different subjects, there are laws in place that prevent that data from being shared (and we’ll explore this in more depth in the chapter on ethics, too).

Finally, a third reason we may not have data on something at all is what we’ll discuss in more detail later in this chapter as errors of exclusion. This is when we fail to study something because we do not think it is important enough or worth the cost of studying it. Collecting data is not a trivial exercise, and we tend to only do it if we think it’s important enough and/or likely to benefit us in some way. The things we have data about in the first place tend to reflect our values as a society. The creation and existence of data itself can then further reinforce those values (we might ask: well, why wouldn’t we measure GPA? We’ve been measuring it for so long, it must be important!).

All three reasons why we might not find data (and they are not the only ones) – difficulty, ethical/practical/logistical, and errors of exclusion – are also often related. We may not try to overcome the difficulty of measuring something if we don’t think it’s worth it. Or we might consider asking about money to be unethical because we have societal norms around it, even if there aren’t legal protections in place.

Adventures in data collection#

The next two sections of this chapter will detail some of the top elements to evaluate when you are presented with a dataset, whether it’s given to you to analyze, or you found it somewhere, or you’re reading the research of others. These guidelines are not the only ones, but are all key areas where data can really mislead us if we’re not careful about its contents. The final section will cover some simple Python code to start to clean up data that’s in the first (and maybe a little bit) in the second category above.

And, while this is not expressly a course about data collection itself, these are all also guidelines to think about when you conduct your own data collection work. In the meantime, as always, we encourage you to not just work with the examples we give you, but to also explore the wide world out there for datasets that interest you. Begin with the first questions above, and then apply the guidelines here and that follow. It isn’t a foolproof way to ensure you have “good” data, but it will help you start to build instincts (like the ones we talked about in chapter 1 for why data science is also an art!) for assessing whether or not you’re working with data that accurately and helpfully reflects the world(s) you hope to understand while also, of course, causing minimal harm to the subjects or others who might be affected by your work (to preview the ethics chapter some more!).