Data science is a science and an art
Contents
1.4. Data science is a science and an art#
You are going to hear this phrase a lot this semester! While it’s tempting to think of data science as a rigid practice where there’s definitely one clear right answer amidst lots of wrong ones, alas, that’s not the case. Data science certainly involves a commitment to precision and scientific rigor, but in practice it also requires a lot of creativity, substantive expertise, judgement calls, and even old fashioned gut instincts.
Data science as a science#
It should not come as a surprise that data science is a science. That said, when people think of data science, they often focus on the “data” side of it over the “science” part. As we’ll see in the next chapter on thinking like a scientist, the science part of data science is crucial for making helpful, useful, and (hopefully) accurate inferences about the world.
For example, some especially scientific aspects of data science include that we need to:
Evaluate the merits of a testable hypothesis (based on a scientific theory) using empirical evidence that has been subjected to rigorous testing
Be precise about how we drew conclusions from our observations and how confident we are in those results, usually using statistics to express that confidence and precision
Build and evaluate huge datasets, including measuring variables of interest, collecting them with (often) computationally-heavy techniques (such as downloading lots of articles from news outlets or processing climate data from a government data hub), storing them efficiently, and preparing them for analysis.
Data science as an art#
As rigorous as we aim to be when we do data science, the real world – from which we collect all data – is messy, which means there’s no perfect dataset, there’s rarely one obvious “correct” type of analysis, and there are often many ways to translate results back into the complexities of reality. This means we need to think like artists when we do data science. For example, we need to:
Translate real-world phenomena, problems, and questions into data science problems (note: this is not straightforward and is often one of the hardest and most often overlooked parts of doing data science “in the wild”)
Develop intuition using principles for “good” vs. “bad” data or models
Think creatively and simply when deciding what to measure, how to measure it, and how to use it in our analyses}
Misconceptions about data and science#
Listen in on a conversation about data and/or data science and you’re likely to hear things like, “What does the data say?” and, “The data speaks for itself.” We will be (much) more specific about this in coming chapters, but we want to highlight here that one goal of this book – if literally nothing else – is to rid the world of these two phrases, as well as the broader misconception that data is equivalent to “Truth.” Rather, data is information that humans have decided is worth collecting, have collected in a particular way using human-designed tools, and have decided to analyze according to our ideas about what might be most interesting or valuable. In short, data doesn’t say anything; humans say things about data.
Another misconception is that science is “correct” – in that, if it’s in a study it must be “right”, and that it’s sufficient to defend an argument by saying it’s “science”. As we will see in the next few chapters in particular, science is iterative, always tentative, and necessarily incomplete. In fact, it’s our humility about our results and the limitations of our research – and our willingness to say we are wrong; nay, seek out why we are wrong – that makes science so powerful.
None of this, however, is license to ignore data or research simply because you disagree with the conclusions. Rather, we acknowledge the limitations of data and the tentativeness of science in order to strengthen both. Science is a process to which a community of practitioners adheres, and one in which we are transparent about the steps we have taken in our analysis and are the first to acknowledge its shortcomings. If you think a study is problematic, it’s not enough to say it’s “wrong” or “biased” on its own. You must be able to clarify what the study is missing or doing poorly and how these missteps are influencing the results. From there, we can to do more and better studies going forward.
Let us be clear: data, science, and data science are incredibly powerful and can absolutely help us better understand the underlying reality out there. But the absolute best and most reliable way to do this is to approach our work with humility about our limitations. That said, we also recommend a dash of fearlessness and willingness to roll up your sleeves and give it a try, which you’ll have plenty of opportunities to do as we go.
If, as you near the end of this book and are exploring a dataset of your choosing on your own, and you feel like you are not sure what the “right” next move is, or which analyses to carry out, or what conclusions to draw – but you have ideas about what might work, are curious and brave enough to try it, and humble enough to seek feedback and update accordingly – then you’re absolutely on the right track.
Conclusion#
Data science is a popular field – and rightly so, as it’s powerful, exciting, and potentially (and in many cases already) transformative. But there are a lot of misconceptions about what data science is, who can be a data scientist, and what data science can do. Throughout this book, we will elaborate on the themes introduced in this chapter: data science is an interdisciplinary field that requires a lot more than just programming; it’s a science and an art; and everyone can (and should!) participate in data science.