3.1. What is programming?#

As with data science more generally, there are lots of ideas, definitions, perceptions, and even stereotypes about what programming is. For our purposes, a computer program is simply a set of instructions telling a computer what to do in order to accomplish something that we, the data scientist, wish to do. And programming, or coding, is the act of designing, writing, executing (and then almost always fixing over and over, or “de-bugging”) those instructions.

Already, this language highlights the human agency (and potential subjectivity, biases, instincts, and so on) in this: we don’t (usually) just sit and write code for the sake of it (unless we’re practicing). We write code that carries out specific analyses that we think are worth conducting using data we think is worth studying – and perhaps even collected ourselves. We also use intuition, creativity, problem-solving, and often senses of aesthetics when we figure out how to design and implement our code. As with data science, there’s rarely one absolute correct way to carry out a particular function or write out a program (though many people have very strong opinions about which way they think is best).

In other words, programming is also (you guessed it!) an art and a science.

Computer programs#

We stated above that computer programs can be usefully thought of as sets of instructions telling the computer what to do. We can also think of them as a way of telling the computer how to solve a particular problem. This usually entails defining a number of smaller problems first and telling the computer how to solve those.

For example, if our problem is “create a video game”, we might divide that problem into a number of sub-problems, such as:

  • “have an environment in which a character can operate”

  • “have a character that we see in third person”

  • “make the character move according to buttons that are pressed”

  • “make the character do something when it runs into a barrier in the environment”

  • “make the background move as the character travels through the environment”

  • “have an awesome dragon (or is it dinosaur?) named Yoshi who can eat stuff and exhale fireballs”

… and so on. While there are best practices – as well as, often, industry or field-specific norms – around how problems are broken into sub-problems, you can see that it’s really up to the programmer how to divide a big problem into sub-problems, as well as how to solve those sub-problems. In data science, defining the problem itself usually means determining what kind of analysis we want to conduct and why.

Programs as recipes#

A common analogy to build intuition around programming is to think of them as recipes. For example, we can think of a broad problem we want to solve as “satisfy my hunger”. That’s pretty vague, so we might need to make a choice, such as “make lasagna”. (A parallel example in data science is translating the problem of “solve world hunger” into something we can actually study with data, such as “understand the effect of X policy change on hunger”, though even that requires more specificity, which we’ll get to in coming chapters.)

We can divide our problem of “make lasagna” into sub-problems:

  • “chop up all the vegetables”

  • “make a sauce”

  • “do something with the noodles”

  • “assemble the vegetables, sauce, and noodles in a pan”

  • “bake it”

As you can see, one of the authors of this textbook not only does not know how to cook, but is also vegan, and so is likely making everyone else reading this extremely mad about the ingredients I’m leaving out because I don’t know how to prepare them.

My own personal limitations aside, there are of course many ways different we could divide up this sub-problem. In fact, to start to get an intuition for how to “architect” a program, we encourage you to take a few moments right now to think about – and even jot down – the sub-problems you would use instead. Feel free also to bring in your lasagna to share with the class.

(By the way, writing out code by hand on a piece of paper or, say, a whiteboard or chalkboard, is (creatively) called “whiteboarding” and is a common way researchers design programs – and especially how teams collaborate on big projects – so is a good practice, even if it’s not “coding” per se.)

Computers need you to spell everything out#

Unlike recipes for humans, we need to be a lot more specific when we write programs. In the example above, even my terrible excuse for a lasagna “recipe” is vaguely useful – you still have some ideas about the supplies you’ll need and what to do with them, even if we’re lacking some helpful details. It’s implied in any recipe, in fact, that you’ve, e.g., gone to the store to get the ingredients, that you have a refrigerator, that you know what a stove is, and that you know how to chop things.

For computer programs, we need to be much more granular. Take the sub-problem of “chop vegetables”: in order for the computer to do this, we need to define what vegetables are and put them somewhere we can chop them. We also need to establish a knife, as well as supply instructions for how to use a knife to cut things up. These are all sub-sub-problems, and they might all appear inside one “chop vegetables” program, or even be smaller programs of their own (back to aesthetics as well as functionality).

As we’ll see when we start to code, there are multiple ways to then carry out these instructions. But first let’s get some intuition for our options:

The one most people think of when they think about coding is simply writing out a program from scratch just like you might write a paragraph in a document. While this is absolutely doable, it can be quite tedious, as we’re reinventing the wheel (or rebuilding vegetables) each time.

Another common approach is to import existing external programs – we call them packages or libraries, or sometimes simple functions – that carry out one or more of our sub-problems. In our recipe program, we might do something like: “import vegetables”, “import knife”, “import cutting motion” and then assemble those items. (Of course, this presumes an external program called “vegetables” already exists – for the most part in this course we’ll be using lots of existing programs, but we will also learn how to write our own.)

A data science (toy) example#

To get out of lasagna world and back to data science, consider the following simple example. Suppose we want to analyze whether the unemployment rate in the United States has increased in the past year. To write a program to do that, we might first break it into sub-problems:

  • import data on employment

  • clean the data for analysis (more to come on what that looks like)

  • define a plot with an x and y-axis

  • assign “time” to the x-axis and “unemployment” to the y-axis

  • specify any plot colors, titles, or other designs

  • generate the plot

As with our recipe example, in order for this program to actually run with code, we could write out all of it by scratch: we’d type the unemployment data we found manually into our coding interface, we’d construct axes, and so on. A simpler way (and the way we’ll mostly do it) is to import an existing spreadsheet of data, import a package that has built-in functionalities we can use to manipulate our data, and import another package that has existing functionality to create and edit plots.

All of this is very abstract so far, but having an intuition for how to think like a data scientist when it comes to programming will help a lot in the long run.

But enough pretend programs. Let’s talk about Python and why we’re using it, and then get to some coding!