Common errors in data
Contents
6.3. Common errors in data#
All data has some kind of error in it. As with our discussion of measurement, imperfection does not mean we throw the data away, and it also is not an excuse to ignore data because we don’t agree with or like the trends or patterns we observe. But, it’s important to be critical of all data, consider carefully why it might be incorrect, to what extent, and how these errors might influence our inferences.
Of course, it’s possible that one or more of the below errors are so substantial that we conclude we actually cannot learn anything meaningful about the world. If this is the case, again, it’s not enough to simply “reject” findings because the data is “bad”. It’s our responsibility to be clear about what specifically concerns us, how it impacts our inferences, and what we’d like to see instead.
To make this more concrete: it’s far too easy to dismiss a study about a sensitive topic – say, climate change, or health outcomes of abortion, or the impact of a change in gun legislation on gun deaths – because we have our own motivated reasons to prefer certain results on the basis of “bad” or “biased” data. This is not what is meant by scientific thinking or critical evaluation of data. If you do think that the data are “bad” or “biased”, it is encumbent on you to specifically state what it is that you think is going wrong with this data, and why that’s a problem. For example, you might point to the fact that you’re concerned we are systematically undercounting certain members of a population, which is leading us to overstate some trend.
One of the biggest distinctions – especially in an age of mistrust of sources and misinformation and data that seems to support pretty much any view we want to further – between conspiracy thinking and actual science is that in science it’s not enough to say we “trust” or “don’t trust” a study or data source. Rather, we must be specific and clear about the strengths and weaknesses in a study, including the researchers’ choice of methods and the quality of the data itself. Then we can start to talk about the conclusions we might draw.
In addition to interrogating how our data was measured, here are four crucial errors to look for in any dataset.
1. Random errors#
Random errors are errors in a dataset that, on average, in large datasets, cancel out over time. Ideally, we also hope our random errors that are orthogonal to our variables of interest – meaning, they do not relate in any meaningful way to the variables we care about, or that knowing something about the value of our variable of interest tells us nothing about our error. For example, if a pulsometer occasionally counts someone’s movement as a pulse, and occasionally misses a faint pulse, but on average the errors cancel out and knowing something about someone’s pulse rate (say, if it’s high or low) doesn’t tell us anything about whether we might expect more or fewer errors, then we are in the world of random errors.
The data collection stage is a common place where random errors can be introduced: perhaps we type the wrong number when we’re manually entering data, our pedometer isn’t perfectly precise, we mishear someone’s reply in a survey (or they mishear us), or some words don’t get scanned properly if we’re digitizing paper archives. Generally, random errors are to be expected somewhere, and as long as there aren’t too many and we’re sure they’re not systematic, we tolerate them as part of working with data. Of course, an important and immediate question is: what counts as “too many”? Well, that depends on what you’re studying, your sample size, and your tolerance for uncertainty, among other judgment calls! (Have we mentioned data doesn’t say anything, and rather we say things about data?)
If, however, we it turns out we are more likely to mis-scan certain words, or mishear certain types of replies, or our pedometer is more likely to miss steps that you take than give you credit for ones you didn’t, then we have entered the world of systematic errors.
2. Systematic errors#
Systematic errors are not random, and they do not cancel out over time. They arise when errors in a dataset are differing systematically from the underlying reality. For example, the values we record in our dataset might be consistently higher, lower, or more extreme than the real “True” values in the world. While in high volumes random errors can make our results overall less reliable and precise, systematic errors – sometimes even seemingly small or benign ones, depending on our research goals – can bias our results in a particular direction that could have meaningful impact on our inferences and any policies we implement as a result of them.
For example, suppose we are conducting a study to predict the winner of an upcoming election. If the voters who support Candidate A are even slightly less likely to answer a phone call from an unknown number (for all kinds of possible reasons) than voters for Candidate B, then we are going to undercount support for Candidate A, and our predictions will overstate the possibility of Candidate B winning. In a close election, even being off by a small amount can make a big difference in the usefulness of our predictions.
Systematic errors are very common and are widely studied. There are many(!) different specific ways a dataset might be systematically wrong, and we have specific names for each. You’ll come across many more as you continue in data science, but for now we will specifically focus on one broad category of systematic error: Selection bias.
Selection bias is when the observations that appear in your dataset are in there due to non-random selection. This is tricky, because often when we think we are following a random selection process, we may in reality not be. Consider the voter opinion poll above. If we call 100 people using random digit dialing, or by putting ads in the social media feeds of 100 randomly chosen people, not all of them will answer the phone or click on the ad. Perhaps younger people are less likely to answer the phone. Or only politically engaged people are going to click on an ad for a political poll. If age and/or political engagement at all track with our opinions about for whom we are going to vote (and much research suggests that it does), then even though our data collection process was random selection, the actual humans that end up as observations in our dataset are not random. In fact, they systematically represent people who are, e.g., older, more politically engaged, who have an internet or phone connection in the first place, and so on. While as researchers we aim to do our absolute best to collect a random sample, we must have the humility, subject matter expertise, and critical thinking skills to imagine how selection bias may be creeping in. And, we must do the same when presented with research from others.
Within selection bias, we can be even more specific. One type of selection bias is response bias. This is when, even if you have managed to include observations that have been randomly selected (i.e., they have not themselves selected into participation in your study for reasons that might influence your results), there still may be values in your dataset that differ systematically from the underlying reality. In survey research, response bias often (but not always) boils down to respondents not telling the truth. This might be due to social desirability bias. For example, it turns out that if you ask people if they are going to vote in the upcoming election, they are generally more likely to say yes, even if they do not end up voting. This might be because generally in the US voting is seen as a normatively “good” thing to do, so the subject may feel pressure to say they will. Or, perhaps the respondent suspects it’s important to the person doing the interview. Or, the respondent may actually intend to vote, but are overstating their own capacity to find time on the actual day of the election to go do it. Response bias and its subset social desirability bias do kind of boil down to “lying” but this does not mean respondents are actively trying to deceive or that it’s done out of malice. It’s built into our societies and cognitive processes.
Here’s another example that’s perhaps less politically charged. Think about when you go to the doctor, and they ask you how often you exercise, or how many servings of fruits and vegetables you eat in a day. On average, we suspect people are more likely to overstate both of those things. This might be because we want our doctor to think more highly of us, or it might be because we personally overestimate how often we do things that we think of as “good” habits. Knowing that there is a likely response bias, and that a likely source is social desirability (or perhaps self-preservation) doesn’t mean we necessarily throw all response-based data away. But it does mean we should keep this in mind when drawing inferences – perhaps adding that we are likely overestimating how much people actually exercise, or specifically that our results are likely biased upwards.
We’ve already seen another type of selection bias: survivorship bias. As you’ll recall, this is when the subjects or units in our study are in there because they happened to survive some process we are interseted in – such as a plane surviving being shot down in a World War, or a person surviving Covid. Of course, it’s extremely difficult, if not impossible, to study planes that did get shot down or people who did not survive Covid. A (slightly) less grim example might be in a drug trial: if a drug trial goes on for a long time, we naturally might anticipate that people will drop out of the study – they get busy, they lose interest, they think the drug isn’t working, and so on. Thus, the results of the study will necessarily reflect those who stayed in it for the full time. If we suspect people who stayed in the drug trial did so for a reason related to the disease we are trying to treat (perhaps they are more critically ill, or care more about their health in other domains), then we might need to adjust our inferences about the effectivness of the drug upwards, downwards, outwards, inwards, or however we best judge. Sometimes we may not even know, but being transparent about that is the best we can do, and then invite others to help us understand and study the direction(s) of the possible bias.
If all of this sounds unsatisfyingly speculative, unfortunately, it is. Not only is it usually quite difficult to be one hundred percent sure the direction that a systematic error is biasing our results, but it’s also not necessarily going to be at all obvious whether or not a systematic error is present. In many areas that have been studied extensively, we have principled reasons and evidence from previous studies to make us fairly confident about the presence and likely effect of selection bias. And we can do things to address them. When we conduct interviews, for example, we can word our questions in such a way as to hopefully reduce some of these biases. For example, it turns out that asking people, “Are you racist?” almost always gets a “no” reply. But rephrasing it to, “Do you know anyone who’s racist?” will get more “yes” replies. There’s further reason to believe that this second version helps us get a better picture of the True levels of racism in a society, but even this is an ongoing research effort.
Thus, while there are many areas where we can predict or anticipate systematic errors – and perhaps even try to mitigate them head on – there are even more areas where we don’t even see it. In fact, there’s a selection bias here in our examples of selection bias – we’ve only included cases where we suspect it’s present. There may be many (many!) where we aren’t even fully aware.
Again, we cannot stress enough that this doesn’t mean we throw away data we don’t like. If we see the result of a political poll that makes us unhappy, we cannot simply say “it’s biased”. At minimum, we must specify what type of bias we think might be afoot, why we suspect it’s biased, and in what direction this might impact our results. Even then, there’s no guarantee of objectivity, but a deep premise of science is that if we keep studying things from many different perspectives and are open to criticisms, imperfections, and improving – eventually, on average, the results we will get will broadly trend towards the Truth. I know, this is the most hopeful think in this chapter so far.
3. Errors of validity#
Errors of validity are when we are not actually measuring what we think we are measuring. This is exactly related to the validation step in measurement. While we certainly want to take every precaution to make sure our measures are valid, this does not guarantee they will be. Often, errors of validity can be difficult to diagnose, because they often come down to disagreeing views about what we’re actually trying to measure (conceptualization) and how we’re measuring it (operationalization).
As we stated in the previous section, there is a lot of research and work on validity out there in the world, and you’ll likely come across them as you continue in data science. For our purposes for this course, we are focusing on the first step: awareness. Whenever someone presents data, or you find a dataset, or you read a study, we all must ask: is this study actually measuring what it claims to be measuring?
A common example of this is SATs and other standardized tests, such as GREs, A-levels, and AP tests. They exist because of a desire to have some kind of comparable way to evaluate student performance or knowledge across different school zones, regions, and even countries. But it remains a (mostly) open debate as to whether, say, SAT scores are really measuring something meaningful about students’ abilities, or whether they are picking up something else – such as access to test preparation resources, or certain types of teachers, or a kind of thinking that is valuable but not necessarily the only one we want to maximize when, say, admitting students, assembling a graduate cohort, or deciding whom to hire). (There are other reasons these inferences are difficult to make about standardized tests in particular, which we will discuss later in the book.)
Because turning the world into data necessarily requires that we simplify many dimensions into relatively few, we always run the risk of making an error of validity. Even something simple like the measures taken at a doctor’s office to evaluate overall health – such as blood pressure, heart rate, temperature, weight, height, and so on – of course don’t capture the full picture of someone’s health (and the focus on some measures over others, such as can also reflect societal norms, conventions, and ideals more than actual underlying health). Again, this is not to say it’s all garbage or we that we necessarily ignore, say, SAT scores and Body-Mass Indices. Rather, it’s to say that we must always be asking: Are we actually measuring what we want to measure? What are we missing? How can we improve?
4. Errors of exclusion#
Speaking of missing things, errors of exclusion are when we are failing to study something because of a lack of interest, perceived importance, or any recorded or collected information on it (and usually all three are related – we haven’t bothered to collect it, because we don’t think it’s important, which means we’re not interested). We mentioned this in the section on looking for data – indeed, one of the first signs you’re dealing with an error of exclusion is when you look for data on something and find that it simply does not exist.
As objective as we all (hopefully) hope to be, life, energy, attention, resources, and computational power are still finite, so we cannot study everything (alas). This means we must choose what to study, which always means choosing what not to study. While it would be great if we could all study what we are most curious about, the reality of needing to make a living and/or get funding to support the research – never mind other constraints – means we may need to focus on topics that have already been studied, that other people deem important, or that various funding agencies happen to consider a priority. Or, if you’re working at a company, you’re probably going to have serious incentives to pour more energy into studying your customers, say, than your own employees – at least in the short term. At the risk of getting extremely meta (not in the social media way), even the question of what we are curious about is most likely informed by the culture we grew up in, the people we surround ourselves with, and the sorts of things we are told are problems by the people we trust.
Errors of exclusion are the trickiest of the four errors to think about, because it requires thinking about what, or whom, we’re not thinking about. To make it slightly more concrete, two ways errors of exclusion might show up are in topics and populations that we deem worthy of studying. In some cases this means explicitly stating what or whom we think is not worthy of studying, but it often means not even realizing what or whom we’re not studying.
Some examples of topics that we have excluded over the years for a variety of social, political, normative, and other reasons include things like various diseases we deemed unimportant at various points. Early on in the HIV/AIDS epidemic, there was very little political will to put money into research on preventing or treating HIV/AIDs. Many historians and scholars since then now point to the many, many deaths that could have been presented had “we” (funding agencies, governments, docots, the general population) deemed this disease to be worthy of our attention. Another example is transgender health. While there is some research on the long-term effects of gender affirming therapies (including the effects of receiving the therapies and being denied the therapies), awareness of and interest in transgender health is relatively new in the scheme of human existence (for all kinds of complicated and complex reasons that we won’t get into here!), and even to this day is widely debated as to whether or not is worth study – or even a valid subject of study.
Particular topics being deemed uninteresting or not worthy of the cost of studying often (but not always) boild down to biases about the people or populations who would benefit from or be affected by this resesarch. In the two (depressing) examples above, it’s now largely thought that HIV/AIDs was ignored because it was seen as a gay person’s disease, and “we” had less value for gay people and/or considered the disease not relevant to “most” people. The same is true today with transgender care. There are also many other examples of populations being ignored in research, and they are often marginalized in some way. These groups can, and have, and often still, include populations such as people who are incarcerated, unhoused, or enslaved; informal laborers; particular racial and ethnic groups; religious groups; people with disabilities; and many more.
This is not to say there isn’t hope. In the field of political science, for example, very little research was conducted until recently on the experiences specifically of Black Americans, and research that did focus on that was deemed as too niche or specific to be useful; graduate students would often be encouraged to expand their research to include white Americans, for example. This is changing (albeit slowly), and now we’re seeing books and papers specifically on how, for example, emotion in politics affects Black voters differently from white voters (see, e.g., Davin Phoenix’s 2019 book The Anger Gap: How Race Shapes Emotion in Politics). Of course, there is much more work to do, and it’s not a given that the field will continue to trend in this direction.
There is no right answer here. Your own reactions to this final section are likely informative about your own views on how we “should” prioritize our research, and we are not here to prescribe one topic over another. Rather, our goal is to increase awareness that the very presence of data on a topic, and the populations that are included in that data, both reflect our collective values about what and whom are worth studying. And, studying something and someone tends to further solidify our certainty that it’s the “right” thing to study.
Good heavens. What can we do about all this?#
In case it’s not obvious yet, there’s no single right answer here. Generally, the things we can do to deal with errors in data are:
Be aware they exist. If, after reading this section, alarm bells don’t ring from now on every time someone claims that their data is “correct” or “True”, then please re-read this section.
Look for all four in a dataset. Knowing they are a potential problem is one step, but then you need to actively look for these errors. What are the potential sources of each? If you are collecting your own data, you can be mindful of this in your research design, but when you’re working with other people’s data, you’ll need to carefully inspect the actual dataset as well as the codebook to understand what’s going on.
Think through how they might affect your results. As we’ve said many times, it’s not enough to say, “this is biased” or “there are errors”. All data is biased and has errors. We must specify what the errors are, why they’re a problem, and specifically how they might bias the results. Are we overstating support for a policy? Are we underestimating the chances of someone leaving a company? Are we missing an entire group of people for whom we have reason to suspect the causal claims will operate differently? You don’t have to know the answer for sure (often we do not), but you must have some claim about why the error matters. Otherwise, we are just sitting in a room saying everything’s a mess without doing anything about it.
Be honest about the limitations of your data. As we’ve said previously in this chapter, if you’re not being transparent about how you collected your data, what’s in your dataset, and why you put it there, then you’re not doing science – you’re doing marketing and propaganda. Science progresses not because we are “correct”, but because we are honest about the shortcomings of our studies, which helps us make stronger inferences, and sets us up for future work to improve on our studies. Science is always a work in progress, and we cannot make progress if we’re not honest about our work.
Ok, I don’t know about you, but my head hurts. Let’s take a break from the existential crises that come from thinking really hard about data and put our hands and brains to work on something that is (at least at a superficial level) more tangible: coding! Specifically, coding to clean up a dataset once we’ve decided to work with it, familiarize ourselves with its contents, further evaluate its quality, and prepare it for analysis.