3.3. Building blocks#

In this section we describe three techniques that are the backbone of almost all programming we’ll do in this book, and in data science generally:

  1. Arithmetic

  2. Naming

  3. Built-in functions

If you have experience programming, you’ll likely find much of this familiar, but it’s worth making sure you understand it thoroughly, especially if you’re coming from another language. If you’re new to programming, the more comfortable you become with the concepts here, the smoother your future programming sailing will be.

Arithmetic#

We’re working with Jupyter notebooks, but this code will work in any interface for Python. As stated thousands of times by now, we encourage you to type in your own Jupyter notebook as you read in order to see for yourself how each line of code works – even for simple ones.

One of the simplest things we can do when coding is simply use our notebook as a calculator. For example, we can do basic addition by simply writing an expression in a cell and pressing “Run” in the menu on the top of the screen or (and this is way faster!) typing SHIFT+RETURN on your keyboard.

12 + 2     
14

In the above cell, the code we’ve entered is simply 12 + 2 and upon running the code, the computer returns 14. Note that while spaces can matter in certain contexts, you can write arithmetic expresssions with or without them. Try the below to see for yourself.

12+2
14
12     +      2
14

We can also carry out other forms of arithmetic:

12-2
10
12*2
24
12/2
6.0

Again, try these out on your own, and experiment with spacing to see that you can be pretty flexible about it in this (not all!) contexts.

One place to see where specificity really matters, however, is that slight changes in the code can lead to very different results. See, for example, the below:

12**2
144

Here, we see that two ** tells the computer to calculate an exponent (in this case 12^2), rather than multiplication. Suppose we meant to write a single * in order to carry out multiplication. Because we are just writing one line of code, we can quickly see that the value returned (144) is not what we are expecting (24). But in more complicated calculations and/or in programs with multiple lines of code where we don’t already know what the correct answer is, we may not see immediately that our code isn’t doing what we think it is. Importantly, it also won’t return an error. It’s up to us as researchers to be very careful about our code and to re-read it many times (just like we’d proofread a document). While an error message is a sign that we have an error, we can also have errors that don’t prevent the code from running.

Throughout, we’ll talk about best practices to minimize these “hidden” bugs, including

  • Running each line of code (or at least the smallest snippet possible for syntax that requires more than one line to run) at a time to make sure it’s doing something that looks “correct” (if you can calculate it separately) or at least “reasonable” (judgment alert!) to you

  • Where possible, cross-checking your work with two approaches to the same thing (for example, you might say, “I think 12**2 is squaring 12, but I’m not positive.” You could then calculate 12*12 to confirm that’s what’s happening.)

  • Consulting official Python documentation (https://www.python.org) or reliable resources online (more to come on how to judge “reliability”) and carefully (SO much more carefully than you think) checking every character you’ve entered to make sure it’s correct (you will be shocked how often you will do this, say, ten times, and then on the 11th spot a typo or other tiny error)

  • In professional circumstances, it’s common practice to ask colleagues to run your programs and look through them with fresh eyes to make sure they’re working as you expect. In this course, we do need you to work on your own for assignments, but you’re welcome (and encouraged!) to collaborate as you’re practicing coding.

Naming#

One of the most important techniques to master as early on as possible is naming. This refers to assigning a name to something so we can use it over and over without having to type it all out each time. This “something” can be lots of things, including arithmetic experessions, variables (essentially a bunch of data about a thing we care about, such as prices or temperatures, but we’ll be much more specific about this soon), entire datasets themselves, and much more.

The code for assigning a name to something is pleasantly simple, but there are some specific requirements for the order in which you both write and run code to name objects that can be a little tricky to get the hang of and are common sources of bugs – that can trip up even experienced coders from time to time.

To name something, we simply type the name we wish to use followed by = and then the expression, variable, dataset, or whatever that we want to be stored as that name. For example:

price = 12 + (12 * 0.08875) 

If you’ve spent much time in New York City, you may recognize 0.08875 as the sales tax for most goods purchased in the city. The formula to the right of the = sign, thus, is calculating the total price of a product that costs $12 USD. Also, technically the ( ) are not needed, but I like to include them for aesthetics and readability (art!), and as part of a habit for more complicated expressions down the road. But, as ever, we encourage you to practice in your own notebook to see for yourself that it works either way (even if you already know it does because of how math works!).

Rather than write that whole formula out each time, we can name it price and from now on – as long as we run the code in this cell before later cells – we can simply write price and insert everything to the right of the = wherever we like. For example:

price
13.065

Now, there is an important difference between the first line of code, where we assigned the name price to our expression, and the second line of code where we just typed price. The first line of code did not return anything – which is to say, it did not produce any kind of result or answer to our calculation. The second line did return something, however: 13.065, which is the result of the calculation. Why did the first line not return anything, whereas the second one did?

The answer is a great example of how we need to be hyper-specific when we code, and hyper-aware of exactly what we’re telling the computer to do. In the first line of code, we are not telling the computer to calculate what is to the right of =. We are telling it to assign the name price to that expression. This means the work done by the computer does not return any value – you can think of it as there being nothing to share. We haven’t asked the computer to return anything.

If the computer doesn’t return anything, how do we know the code did what we wanted it to do? This is an example of one of our anticipatory debugging best practices from the previous section. We can check our work by entering price into a new cell. Simply putting the name in the cell, provided we previously defined it, will return the value of that expression, as we see in the second line of code above.

An even better habit to get into is writing print(price), as below:

print(price)
13.065

In this particular instance, there’s no difference in output between writing price or print(price), but with more complicated code we will see that using print is helpful, if not necessary. Generally, however, I will be the first to confess I almost always forget to write print unless it’s necessary for the code to work, so: do as I say, not as I do.

Our price formula is pretty nifty so far, but unfortunately it’s only useful for products that cost exactly $12 in New York City at the time of writing this (very) tiny program. What if we want to make a more general price calculator that we could easily adapt to different products and locations? Luckily, we can create multiple named entities, each representing a variable in our expression, which will allow us to write a more generally useful program. For example, we can set it up as follows:

price = 12                                     #set the price
tax = 0.08875                                  #set the tax

total_price = price + (price * tax)            #formula to calculate the total price
print(total_price)                             #print our answer!
13.065

This isn’t the most beautiful program ever written, but we now have something we can work with in multiple scenarios. Suppose we now are buying something in New Jersey that costs $8. We can simply replace the 12 with 8 in price and replace 0.08875 with 0.06625 in tax (and also remind ourselves to go buy stuff in NJ whenever possible) as follows:

price = 8                                      #set the price
tax = 0.06625                                  #set the tax

total_price = price + (price * tax)            #formula to calculate the total price
print(total_price)                             #print our answer!
8.53

In addition to us now having a fabulous tool that we can get out every time we’re shopping, there are a few other important coding lessons in here:

Commenting out with ##

In the last two cells, notice that to the right of the code we’ve included comments written after a #, such as the first one “#set the price. This is called commenting out and is an extremely useful technique in coding, and is absolutely a best practice. Throughout our code, we can annotate anything we like by simply writing # followed by our note or description. In a simple program like the one we’ve built here, this may seem unnecessary, but it’s extremely helpful and time (and misery) saving when we build longer programs. It’s helpful because we may want to remind our future selves what we actually did, and why. It’s also a way to quickly review our program to find the part that does something specific (“I just need the code that sets the price … where is it?”). And, it’s exceptionally helpful for collaborating with others. Reading other people’s code – even in a highly readable language like Python – can be frustrating, and good documentation is a very welcome gesture. As you encounter more code snippets online, you, too, will come to appreciate the value of good documentation.

There are no hard and fast rules about what to write, or how to write it, but generally you’ll find the convention is to simply and briefly explain what each line of code is doing in your program. You also don’t have to document every line, but (generally) erring towards more is better than less, though this also is a judgment and aesthetics call as well. For example, many people might skip documenting a simple, well-known line such as print, but there’s no real harm in being thorough. I also tend to prefer the comments to not span more lines than the code itself – unless I’m providing a longer description of an entire program at the top of a cell – but this is one hundred percent a design preference (coders are weird, eh?).

The only real “rule” about commenting out is that it needs to be after your code if it’s on the same line, or on a separate line of its own. This is because everything after the # will be read as a comment and not as instructions for the computer. As ever, try a few variations in comment placement to see for yourself.

Writing all the code in one cell vs. spreading over multiple cells#

Sometimes in the above cells there is just one line of code per cell, whereas in others (such as the one immediately above), there are multiple lines. How do you know when to spread your lines of code across multiple cells, versus do it all in one cell? There generally isn’t a fixed rule for this, though different coders will have strong opinions about the aesthetics. Ultimately, as long as your code runs, it doesn’t really matter for the code we’re doing, but here are some general guidelines:

  • When first writing a program, err on the side of as few lines as possible per cell until you know exactly what it’s doing and are sure it works. This will make debugging easier and will increase your own understanding of what’s going on in your code!

  • Once the program or section of the program is written, it’s common practice to put everything that does one particular thing into one cell. For example, in the code cell above, the multiple lines together all accomplish one thing, which is calculate total price. It would work if it were spread across multiple cells, but there’s no real use for each cell on its own.

  • Once you’ve confirmed your code does what you think it does, putting it all into one cell can actually reduce bugs that come from running cells out of order. Which brings us to the next lesson!

The order in which your code runs is important#

Jupyter notebooks are very useful and have lots of great features, especially when learning to code. One big one is that there are different cells and no need for a compiler, so you can iterate and test code quickly as you go. That said, a very common source of errors in code, especially when you’re first starting out, is running the cells out of order. Generally, you want to write your notebooks such that they flow from top to bottom. For example, in the code above we need to define price and tax before we can run our equation with those variables. If we try to run the equation before we define those terms, we will get an error.

We will see a lot more examples of this very soon, but generally one of the first things I check when coding if I get an error message but think my code is right is to re-run the code from top to bottom to see if I accidentally ran something out of order. (Simply having the code written isn’t enough – you need to run the cell.)

One quick way to debug errors that come from running cells out of order, as well as to confirm at the end of your work that your entire program is written in the right order, is to go up to the menu bar to Kernel > Restart and run all. This will (not surprisingly) clear all your output and run your entire notebook afresh from top to bottom.

How do we know what to name something?#

As with much in coding, there aren’t too many formal rules about naming – as long as your code runs, it’s technically ok. But there are a few problems to avoid, as well as some strong conventions and best practices. Here are a few big ones:

  1. Names need to be one word. If you want to use more than one word for the name, the best practice is to connect them with a “_”. We saw this above with total_price. (As opposed to, e.g., totalprice. But, you’ll see coders break this guideline a lot!)

  2. Names cannot be words that refer to other commands or functions in Python or in our program. For example, in the next section we’ll learn built-in functions, such as import, which does something specific in Python. Thus, we cannot name something import.

  3. Names should be things that are easy for you to remember and others to understand. In keeping with the broader Python readability aesthetic, generally names that describe clearly what you’re referring to will also help. For example, your life will be easier if you use names like total_price rather than t_pr or something like that, even if at first it’s quicker to type (always think of other readers and your future self coming back to the code – you want to make it as easy as possible for everyone to tell what you’re doing).

We’ll get lots more practice with this going forward, but these three points should help you start from a solid foundation!

Built-in functions#

Python has a number of built-in functions that we can use to perform some common mathematical operations, as well as carry out frequently used instructions, that we can use without having to do anything special other than simply write the code. For example:

abs(-6)         # find the absolute value of -6
6

Not surprisingly, the above code abs( ) finds the absolute value of any number you put inside the ( ). The full list of built-in functions in Python is available here on the official Python documentation for built-in functions. Most of them may not mean much to you right now, but we will use many of them going forward.

While it’s certainly helpful and efficient to memorize some of these (for example abs( )) so you don’t have to look it up each time, do know that you don’t need to memorize these, and in fact you’ll naturally memorize the ones you use more (another you’ll see a lot of is import). So, unless you love memorizing things, do not feel like you need to sit down and memorize the table of built-in functions on the Python documentation. Rather, it’s something you can consult as you go – say, you’re working on something and you think, gosh, it would be helpful to find the absolute value. Is there a built-in function I could use? Then you can look it up.

That said, note that there are other ways to find the absolute value of a number without using the built-in function abs( ). We won’t write out the code now, but you could imagine something where you give your computer an instruction to evaluate whether a number is a negative number, and if so to multiply it by -1. That would get you the same result, but in this case you’ve written the function by hand instead of using a built-in one.

This is just one example of many in coding where there are lots of different ways to get to the same answer. Generally, we’ll share the most common approaches, but we’ll also highlight other routes where relevant, and discuss why we might use one method over another.

Finally, you might be wondering how you’d go about figuring out if there is a built-in function for what you want to do. Yes, you could pull up the table each time, but another habit we encourage is trying things. For example, consider the below variable, which this time has more than one value (we’ll give this an official name in the next section):

heights = (60, 64, 58, 70)      # create heights with 4 values

Suppose we want to write a line of code to find the maximum value of heights. We could simply try out something that seems like it might work, such as:

max(heights)        # find the max!
70

Fun! Let’s see if it works for the minimum, too:

min(heights)        # find the min!
58

Ok, great. What if we want to find the mean? Alas, if we run mean(heights) we will get an error (try it in your own notebook to see for yourself). Specifically the error message will say “name 'mean' is not defined”, which means (LOL) just what it sounds like – that there is no defined instruction mean for the computer to follow. Thus, in this case there is no built-in function called mean (well, there could be a function that calculates the mean but is just called something else, which we could check in the table – but in this case I will just tell you there isn’t!).

In order to calculate the mean of height, thus, we’ll need to do something different. We could either do it by hand (e.g., write code that says add up all the elements in height and divide by the number of elements) or we could import a package or library (we’ll use those words largely interchangeably in this book, much to the dismay of purists) that contains some of its own specialized functions that are built into that package. We’ll see lots of examples of this very soon, but a preview is something like:

import statistics              # import a package called statistics
statistics.mean(heights)       # use a function in the statistics package called mean to calculate the mean
63

We talked about packages earlier in this chapter when we discussed some of the features of Python – indeed, the statistics package is an example of one of many packages we’ll be using this term. Basically, you can think of packages of add-ons to Python that provide lots of other functions specific to whatever it is we’re doing. We’ll use others for dataset manipulation, more advanced mathematical calculations, data visualization, machine learning, and more. Because it’s often helpful for intuition and coding practice, we’ll also show you how to do many of these by hands as well.