13.1. Bayesian inference#

Yet another deep debate within data science and broader scientific worlds centers on how we approach statistics. Astrophysicist Edwin Turner summarizes the situation well:

*“Statistics sounds like a dry, technical subject, but it draws on deep philosophical debates about the nature of reality.” *

Two big philosophies in statistics – and, really, about the world – are frequentist (or “classical”) and Bayesian statistics. As with our stance on earlier debates discussed in this book, ours is that the two are generally complementary rather than competitive, and each can be more or less useful depending on our goals. But, of course, nothing riles some people up like neutrality, so we shall again brace ourselves for hate mail.

Regardless of your own views on the fundamental Nature of Things, Bayesian reasoning, which has been around courtesy of Rev. Thomas Bayes since the 1700s, has become a central component of many advances in data science, and an introductory book on data science such as this one would be remiss to not at least introduce the idea and build some intuition around it, so that when you encounter, for example, Bayesian machine learning classifiers in the future, you’re prepared!

Frequentist vs. Bayesian#

Our book so far has focused on frequentist statistics. The fundamental (and perhaps seemingly so obvious that it doesn’t even need to be said) idea is that the probability of an outcome is the relative frequency with which it occurs. Indeed, spelling it out like this feels a bit like being fish declaring we are in water – most of us don’t even know we are in it, never mind question whether we might consider being in something else.

The practical implications of this are things like: the probability of flipping a coin and getting heads is \(\frac{1}{2}\), and the probability that a card we randomly drew from a full deck of cards is a King of Hearts is \(\frac{1}{52}\). This simple observation – that the probability that something will happen is exactly in proportion to how often it occurs – has been the backbone of all our statistics work so far.

Bayesian reasoning, on the other hand, states that probability expresses our degree of belief in an event. This means we incorporate data from an experiment along with our prior (before the experiment) to predict our posterior knowledge. In particular, we are interested in conditional probabilities – the probability that something is the case given some new information on that thing.

Consider the example above of pulling a card from a deck. We said that that probability that the card we pulled is a King of Hearts is \(\frac{1}{52}\). But now suppose we got some new information about that particular card we pulled (e.g., that it is red, or a face card). We can now ask: What is the probability that the card we pulled is a King of Hearts given this new information? What is the probability I pulled a King of Hearts given that the card I pulled is red, or a face card, etc.?

Bayesian reasoning is very powerful, and especially so when we are in a situation when we have a prior belief about the likelihood of something, and then are given new information about that thing. It turns out, this situation is very common in many machine learning problems.

A learning example#

Bayesian reasoning centers on learning new information about the thing you care about, and then updating your beliefs about that thing. For example, suppose you meet a random person somewhere in the US. You might want to know: what is the probability they are a US citizen? (Note: not for frightening deportation reasons, but rather as an exercise in intellectual curiosity, and perhaps preparing yourself to enjoy the exposure to other cultures; but we digress.)

We know that (at the time of writing):

  • Around 329 million people live in the US

  • Around 285 million are US citizens.

With simple probability, we now know that the probability that this random person we meet somewhere in the US is an American citizen is:

\[ p(\text{US citizen}) = \frac{285}{329} = 0.87 \]

Now suppose we are given new information that this person is “foreign born”, meaning, they were born somewhere that was not the United States and is not a natural born citizen (i.e., their parents are US citizens). Does this new information cause you to update your estimate of the probability they are a US citizen? If so, up or down?

Our question is now what is the probability this person is a US citizen given they are foreign born? We can express this as follows, where “\(|\)” indicates “given”:

\[ p(\text{US citizen}|\text{foreign born)} \]

We can now say our question is one of conditional probability: we are conditioning our estimate on some new piece of information. So, how do we calculate this conditional probability? We turn to Bayes’ rule!

Bayes’ rule#

Bayes’ rule states:

\[ \text{posterior} = \frac{\text{prior x likelihood}}{\text{evidence}} \]

where:

  • posterior = our belief after getting new information

  • prior = our belief before we got new information

  • likelihood = the probability of observing our new information given the other data we have

  • evidence = the probability of our new information

or, more formally:

\[ p(A|B) = \frac{p(A)*p(B|A)}{p(B)} \]

where A and B are “events”. In the case of our learning example above, A = being a US citizen, and B = being foreign born. We can map this example to the equation:

\[ p(\text{US citizen}|\text{foreign born}) = \frac{p(\text{US citizen})*p(\text{foreign born}|\text{US citizen})}{p(\text{foreign born})} \]

In other words, our posterior, which we would like to know, is what is the probability someone we meet is a US citizen given we find out they are foreign born.

And the answer can be found by multiplying our prior (the probability someone is a US citizen) by the likelihood (of someone being foreign born given they are a US citizen), and then dividing that by the evidence (the probability that someone in the US is foreign born).

Happily, we can look up all the numbers on the right side of the equation.

  • prior: we already know from above that \(p(A) = 0.87\)

  • likelihood: around 22 million American citizens out of 285 million are foreign born, so \(p(B|A) = 22/285 = 0.077\)

  • evidence: around 44 million people in the US are foreign born, so \(p(B) = 0.134\)

Inputting these values into our equation gives us:

\[ p(\text{US citizen}|\text{foreign born}) = \frac{0.87*0.077}{0.134} = 0.5 \]

Thus, knowing this new information about the person we met causes us to lower our estimate of the probability they are a US citizen.

Subjective priors#

What does all of this have to do with making inferences about the world? For one, applying a Bayesian perspective, when it makes sense to do so, is a potentially useful way to reframe a problem. For another, it may be the case that we don’t always have access to the information on the right side of the equation. Or, the data perhaps isn’t particularly reliable (for all the important reasons we discussed in chapter 6). Or, there are multiple datasets and we disagree over which one(s) to use.

Challenges to inference in this case arise due to the fact that different people may have different subjective priors. They may hold different beliefs about the world that may be informed by previous findings, information, or even values and preferences.

For example, suppose we were unable to look up the probability that someone we meet in the US is an American citizen. We know it’s 0.87 because we looked it up and decided the sources we read were reliable. But what if we didn’t have time to look it up, or we didn’t trust the source, or this data wasn’t even collected at all?

In this case, we’d probably guess, and we might each guess something different: 0.7? 0.9? It’s also not difficult to imagine that the guesses we make are related to our worldviews. For example, someone who is very worried about immigration to the US might have a lower subjective prior – they might assume that fewer people in the US are Americans than is really the case. Someone who advocates for more open borders might have a higher subjective prior.

Consider other examples that might be affecting policy:

  • What is the probability of a random person in the US testing positive for Covid?

  • What is the probability of a random person in the US voting illegally?

  • What is the probability of an extreme weather event in the US in the next 100 years?

If we cannot look this data up, or we do not trust the data, or the issue hasn’t even been studied, we likely have to rely on our own subject priors as we make sense of the world – and, as we saw in the application of Bayes’ rule above – this can cause us to form incorrect conclusions about the world, even if we have the best of intentions (and we realize that’s perhaps an optimistic “if”).