3. Programming 1#

While data science is not about only computer programming, programming plays a central role in data science. It is not only the primary tool through which we will carry out our analyses in data science, it also is one of the engines of the field itself: advances in computer science and computer programming are at the heart of our ability to collect new (often enormous) datasets, store and manipulate those datasets, and use that data to understand and predict outcomes in the world. In the second half of this book we will use programming to build machine learning models that make predictions using data, but there’s a lot of valuable research we can conduct – and discoveries we can make – with much simpler models (and simpler code).

In this chapter we’ll first briefly introduce you to what a program is, offer a broad lens on how programming works, and provide a bit of background on the Python programming language and why we have chosen to use it in this book rather than other languages.

That said, this is not exclusively a programming book or course, so we’ll keep all of this rather high-level. From there, we’ll get right into essential programming concepts that will provide the backbone for our data science coding to come:

  • Building blocks, including arithmetic, naming, built-in functions

  • Data types, including individual values and sequences

  • Libraries, which allow us to build in additional functions for more advanced data science techniques

  • Tables, which will be the primary format by which we store and manipulate data

As with many things in life, one of the best ways to get better at programming is to do it. We’ll offer example code in this chapter, and we encourage you to open a Jupyter notebook (or another preferred interface, if you have one) and practice the code alongside us. This will sound pedantic, but especially at these early stages, we encourage you to actually type out rather than copy/paste the code. In our experience, actually writing rather than just reading code is very helpful for developing intuition for what’s going on in each line.

We also strongly recommend that you experiment with the code as you go. Wondering what happens if you, say, remove a parentheses from a line of code? Want to know if a particular technique works for words as well as numbers? Try it out! While of course we will teach you the fundamentals, principles, rules, and even aesthetics (art alert!) of “good” coding, we will regularly emphasize the importance of:

  • Conducting your own trial-and-error tests for de-bugging and troubleshooting your programs

  • Consulting formal and informal Python documentation to understand how to do something and how it works

  • Learning to write code on your own, including from scratch, but (far more often), cobbling together other snippets of code created by others (with attribution, where appropriate)

In our own experience as coders, one of the most useful skills to learn when coding is how to learn how to code.

This is something of a controversial perspective among our students who simply want to be told what to type, but being able to figure out how to use, adapt, and/or generate novel code is essential for success in data science. Whether you go into industry, research, government, or wherever, you’ll always encounter new datasets, new types of problems, and new techniques. Being able to grapple with all of these on your own will set you apart from other data scientists.

Our goal by the end of this book is to equip you not just to write, execute, de-bug, and understand the code we give you (and we’ll give you lots!), but also to go out and find, adapt, and/or figure out how to write new code to address data science challenges and questions you’ll encounter during the rest of your life.