11.1. Stats vs. ML#

One of the best parts of being a data scientist is getting to argue with other data scientists on the internet about data science. Or, if you hate conflict like at least one of the authors of this textbook does (cough), it’s lurking on the internet while reading other people’s arguments about data science. If you’re looking to stir up a fight – or wondering where to go to find a raging one – look no further than the debate between statistics and machine learning.

Pretty much all sides of the debate are out there: some will argue that they are basically the same thing, some will argue they’re entirely different fields, some will argue that the distinction doesn’t matter, and others will argue that the distinction is at the heart of the field itself. Add a bit of discipline-specific territorialism, good old fashioned egos, and the general chaos of the internet, and you’ll want to get your popcorn out to watch it all unfold (here is a “fun” place to start, for the darkly curious.

In this section, we want to briefly pause in our rolling out of methodologies to take a high level view on both the statistics and ML approaches we’ve covered so far. As will likely come as exactly zero shock to you if you’ve read anything else in this textbook, the “right” technique to use will depend on your data, your goals, and the underlying phenomenon you are trying to understand. (Well, this is the ideal – in reality, the “right” technique, sadly, might be the one your boss thinks sounds the most impressive for their upcoming presentation, even if it’s not exactly appropriate for your work, and one of our goals in this book is to equip you with the ability to perhaps, maybe, help them understand otherwise. But we also realize people need to keep their jobs. Sigh! But I digress.)

Statistics#

At the risk of inciting our own firestorm right here (you’ll note we do not allow comments on this textbook!), one way to think about when to use statistics vs. an ML approach is in terms of what the goals of your research are.

Goals#

We hope it’s not controversial to say that one of the big reasons we do statistics is in order to learn things about the world from data (bold, I know!). Specifically we do this by formulating a mathematical representation of how we think aspects of the world that we care about (variables) fit together, including how they interact and influence each other. We express this in the form of models, which we then test against data.

On average, at least at the time of writing this book, one way to think about statistics (how’s this for caveating!) is that it is particularly well-suited for making inferences about the world in the form of explaining how things work, such as how a change in one variable might be associated with a change in some other variable (we even specifically used this language when interpreting our OLS statistical outputs).

Consider the following questions:

  • What human behavior has the biggest impact on climate change?

  • What is the relationship between education and health?

  • What policies might reduce poverty in my community?

All of these questions would likely lend themselves well to a “traditional” statistical analysis, very much including something along the lines of a linear regression like we’ve seen. We’d need to do some thinking about our linearity assumptions and require a continuous dependent variable, but in principle, it should be easy to see that we could set up a regression equation to evaluate any of these three questions, and by now you should be able to imagine how we might interpret the results (e.g., “a one-unit change in years of education is associated with a y-unit change in personal health, measured in terms of, etc.”). Ok. So far so good!

Origin story, superpowers, and weaknesses#

One difference between statistics and machine learning is that statistics has been around a lot longer – by many(!) centuries, in fact. Its origins are in mathematics, and in the early days was even known as the mathematics of collections of observations.

Statistical analyses can generally offer us a better understanding of the relationships between variables in our model. But, the tradeoffs include we need to specify the model, including being careful about what type of model we are using and its appropriateness for our data, and we also need to, of course, be able to correctly (not trivial!) interpret the output and diagnostics, as well as think carefully about the model fit, the underlying data generating process, and of course the quality of the data itself,

Machine learning#

Goals#

Brace yourself: one of the big reasons we do machine learning is also so that we can learn things about the world from data! The departure from statistics, however, is that generally we use algorithms that “learn” (so far, our examples effectively include finding patterns) the relationships between variables. Unlike statistics, we do not necessarily need to specify a model at the outset, though we do need to specify our research goals and the relevant variables, or attributes (one big difference is that in ML we get to call them exciting things like “attributes”, “features”, and or “dimensions”, whereas in statistics you’ll likely encounter a lot of boring old “variables” – snooze fest!).

Machine learning is particularly well-suited for prediction. Consider the following questions:

  • What will the mean annual temperature be in NYC in 50 years?

  • How long can we expect someone with a college education to live compared to someone without one?

  • What will the poverty rates in each borough of NYC be in ten years under three different policy regimes?

These are similar to the questions in the statistics section above, but the goals are slightly different, in that we’re really interested in how something will turn out in the future, rather than the general effect of x on y. In some cases, either approach might be equally useful, or we might benefit from doing both.

Origin story, superpowers, and weaknesses#

Machine learning has its root in computer science and artificial intelligence, and the term itself is dated back to 1959 at IBM, though it really took off in the 1990s and has continued since. It’s based on a deep idea that computers can “learn” from data.

An advantage of machine learning in addition to its emphasis on prediction is that we don’t necessarily need to have prior assumptions about the relationships between the variables (though we do need to (usually) decide what variables to consider in the first place, as well as what’s worth studying). The tradeoffs are that, first, we generally need a lot more data to make these predictions than we do to test (some) reasonably robust statistical models. We also should, as with statistics, think very carefully about where our data comes from, how it was measured, the underlying distributions, and so on, though, as with statistics, it can be (dangerously!) overlooked.

The bigger tradeoff with machine learning is that especially as our models get more complicated – including with more variables and more sophisticated algorithms – we will run into “black box AI”, where we are getting predictions (or text outputs from ChatGPT), and they might even be really good, but we don’t really know why we’re getting these predictions.

It’s very much an open discussion as to whether we necessarily need to know why or how our program is coming to its predictions. After all, if we just want to know what to invest in in the stock market, we might not really care. On the other hand, if we are using an algorithm to predict who is likely to be a successful performer at our company while we are doing hiring, we might really want to know how it’s working, both for ideas about what to look for, but also to understand how biases in the data and the algorithm itself might be operating against particular groups (and to preview the ethics chapter to come: just as we saw in chapter 6 that all data is biased, so, too, all algorithims are biased).

Our powers combined#

While the debate between and about these two approaches to better understanding the world using data will rage on, we suggest that the “best” or “right” approach is going to depend on (say it with us!) the phenomena we’re trying to study, our data itself, and what we’re trying to do – as well as what’s available to us in our skill set. In addition, here at Data Science for Everyone we like to think of data science as a kind of bridge between the worlds of statistics and machine learning. Experts in ML can help us build algorithms to make predictions we hope are helpful, and statisticians can help us think more precisely about the underlying relationships between the variables. Indeed, as we said at the very beginning of this book, our goals in data science are to use data and science to better understand the world through observation, inferences, and prediction, so we will take the more conciliatory approach that the two approaches are both distinct and overlapping, and ultimately complementary.

Come at us, internet ;).