6. Working with data#

For all the excitement around data science out there in the world today, relatively little attention is paid to the data itself. That’s not to say there aren’t lots of people working on and thinking about the quality of data – indeed, we’ll learn in this chapter about a number of established diagnostics by which we evaluate data, all of which are the result of much careful work – but relatively speaking, when we think about data, the promise of data science, and the exciting developments in the field we tend to spend more time thinking about the computational power and algorithms behind data science than about the data itself.

One of the opening principles in this book was that data is not “True” and data doesn’t on its own say anything or “speak for itself.” In this chapter, we will look more closely at some of the big reasons why this is the case, as well as gain skills in how to evaluate any dataset that comes our way.

There’s a saying in statistics that goes, “garbage in, garbage out,” meaning no amount of statistical or modeling and mathematical gymnastics can make up for bad data (and we’ll be specific very soon about what makes data “bad”). The same is true for the machine learning and other data science techniques we’ll learn in later chapters. We can certainly build or otherwise adjust statistical and machine learning models that are more appropriate for different types of data, but generally one of the most powerful things we can do as data scientists is make sure we fully understand the dataset(s) from which we’re working, be thoughtful about its shortcomings, and incorporate our awareness of those shortcomings into the inferences we make from our work as we apply it back to the real world.