4.7. Error probabilities#

Above, we made the point that testing hypotheses is about binary decisions. We reject the null hypothesis, or we fail to reject the null hypothesis. Of course, we never know with certainty whether the null hypothesis holds in the population or not: the best we can do is use our sample, and some of the techniques we’ve learned, to try and make an inference about it. But even with large samples, we will not be “sure”.

This implies, that we could make an “error”—for example, deciding to reject the null when we should have, in fact, failed to reject it because it was true (which again, we cannot observe for sure). In fact, there are four possibilities:

Null is true

Null is false

Fail to reject null

correct inference

Type II error

Reject null

Type I error

correct inference

This is an important table, so let’s run through what it means. First, the “easy” cells:

  • top left: the null hypothesis is true, and we fail to reject the null. That’s correct: if there is, in fact, no relationship between these variables in the population we should not reject the null (of no relationship)

  • bottom right: the null hypothesis is false, we reject the null. That’s correct: if in fact the null is false, then there is a relationship between these variables in the population, and we should reject the null hypothesis (and presumably claim we have evidence consistent with the alternative hypothesis).

Now the errors:

  • top right: the null hypothesis is false, but we fail to reject it. This is saying that there is a real (in the population) relationship between the variables, but we concluded from our sample that there was not. This is called a type II error (said “type 2 error”, or “error of the second type”).

  • top left: the null hypothesis is true, but we reject it. This is saying that there is, in fact, no relationship between the variables in the population but we looked at our sample and concluded that there was a relationship in the population. This is called a type I error (said “type 1 error”, or “error of the first type”).

The fact that type I errors are so named suggests that they are deemed very important or very concerning. And this is the right intuition; in general, we do not like situations in which we claim there a “real” relationship between variables, when in fact that relationship does not exist. One way to think about this is in terms of least harm: we might be very wary of saying that a drug works for a particular disease, when in fact it does not. But more broadly, worrying about type I errors is consistent with the ‘skeptical’ position that statistics tends to privilege. That is: the world is messy, knowledge is precious, and we are always concerned that we might be drawing overly optimistic conclusions about our (alternative) hypothesis. More crudely, data scientists generally prefer to claim “no evidence” of some treatment, rather than making big claims that turn out to be false.

How often will we make the errors in the table? Well, the type I error rate is \(\alpha\)—the cutoff for declaring “statistical significance” that we met above. Suppose, as is often the case, that it is set to 0.05. This means, out of 100 repetitions of an experiment, using our decision rules and the techniques they are attached to, we are prepared to make a type I error 5 times (on average). Or about \(\frac{1}{20}\) times. If we set \(\alpha\) to 0.01, we are saying we are prepared to make a type I error 1% of the time.

If making type I errors is “bad”, why not simply set \(\alpha\) to 0? Because at zero, we would never have evidence that any relationship holds: that is, the p-value can never be less than zero, so this is equivalent to never rejecting the null. A broader point here is that all statistical testing involves some implied balance between potentially making mistakes, and potentially never having a finding (other than the null). What, exactly, the correct balance should be is widely debated.

A final point here is that there are particular decisions that control our type II error rate (connected to something called the “power” of a test), but those are slightly beyond the scope of this course.

Replication and p-hacking#

Above, we said:

out of 100 repetitions of an experiment, using our decision rules and the techniques they are attached to, we are prepared to make a type I error 5 times (on average). Or about \(\frac{1}{20}\) times

Suppose that a treatment has no “real” effect on an outcome—say, a drug does nothing for a particular disease. Suppose also that 100 different university teams or private labs are doing experiments on that drug. If they all impose an \(\alpha\) of 0.05, then five of those teams will find that the treatment has a statistically significant effect, just by chance. That is, they got an unusual sample of humans (or whatever), and it appears to them that the drug is genuinely efficacious. But, in fact, those five teams just made type I errors.

If all 100 teams are compelled to publish their results, then 95 of the teams will report null results (no effect), and 5 will report positive results. If we read all 100 studies, it should temper any hope we have that this drug works.

But now suppose there is a publication bias, that is

that studies which find no zero effects of treatments are much more likely to be published than those that find zero effects.

How could this happen? Well, perhaps researchers who find null effects think they aren’t interesting and put them in their “file drawer” to be forgotten, as they move on to try something else. Or perhaps researchers send their findings (including those who got null results) to scholarly journals, but those journals aren’t interested in publishing null results. This may be because editors or reviewers think such results aren’t particularly interesting: they don’t care about a drug that doesn’t work.

But the consequence of this pattern are very concerning: now, the only results that are published are type I errors (but no one who is publishing the results knows this). These are sometimes called “false positives” insofar as we think the drug does something (“positive”), but it doesn’t (so the positive is “false”).

Think about what this means for a researcher reading the journals: they only see accounts of the drug working—not the 95 times it did not. If they try to replicate the study with the drug—meaning they follow the same protocol, using the same drug, the same sample size of subjects etc—that replication will almost certainly result in a null. But this is because there is, to reiterate, no effect of the drug in reality.

One can imagine that if this practice of publication bias goes on at a large scale, we could have a replication crisis. That is, journals could be full of results that were all type I errors originally, and that simply do not hold up when other researchers attempt to repeat those experiments. And recent work looking at large numbers of studies, including “famous” ones, suggest considerable problems on this front. See, for example,

Camerer, Colin F., et al. “Evaluating the replicability of social science experiments in Nature and Science between 2010 and 2015.” Nature Human Behaviour 2.9 (2018): 637-644.

Notice that to get these concerning results, researchers need not do anything “wrong” or with bad intentions. They just need to be unlucky (or lucky) and part of a publication system that overly rewards non-null findings. Things can also go wrong, and perhaps be worse, when researchers play an active role in potentially misleading the field. One concern is p-hacking which occurs

when researchers collect, manipulate or test data until they have a statistically significant result

For example, a researcher might gather data and run a one-tailed test as above. Perhaps this doesn’t yield a statistically significant result, so they draw a new sample, or recode a sample slightly in line with another part of their theory about NYC districts. Now the result is statistically significant, so they write it up and submit it to a journal. One can easily imagine why this too would have bad consequences for the replicability of findings.