3.6. Tables#

The last thing we will do in this chapter is construct tables. Tables are the fundamental way we organize data in data science, and are central to our work going forward.

Tables have two components: rows and columns. It’s crucial that you remember what each contains and represents (and, if you ever encounter a table that doesn’t follow this format, it’s very likely it was created by someone inexperienced in data science!):

  • Rows = observations

  • Columns = variables (sometimes called features, labels, and more depending on our type of analysis)

An observation is an instance of the thing we are studying. We will frequently refer to this as the unit of analysis. If we are working with a dataset about students in a class, each observation will likely be a student. If we’re studying data about, say, countries’ GDP, each country is likely to be an observation (we are saying likely because in principle the data could be structured differently, but this is the most common setup). Data about, say, rainfall in one year might be collected in terms of observations that are inches of rainfall per month – so each row, or observation, would be one month of the year.

In chapter 2, John Snow was evaluating people to understand who was infected with cholera, thus we could imagine his dataset was set up such that an observation was a person. That said, we also saw a table containing the results of his natural experiment, in which each row (observation) was a water supply company. Ultimately, it’s up to us as researchers how we wish to structure our data – and it often depends on the nature of our research question and the types of analyses we think will help us explore those questions.

A variable is an attribute or piece of information about each observation. We call it a variable because it can take on varying values. In John Snow’s table, some variables might have been a person’s infection status, information about their proximity to the Broad Street Pump, and perhaps other information of interest, such as their temperature, or anything else he deemed relevant.

We did not see a specific table of data of his people he treated or included in his study, but we could imagine one, such as the very historically accurate one below:

name

infection_status

pump_proximity

temp

Tyrion Lannister

1

45

100

Jorah Mormont

1

60

102

Samwell Tarley

0

210

98

Arya Stark

0

180

99

Notice the variables under infection_status take on a value of 0 or 1. We call these dummy variables and they’re very useful in data science. We use them whenever a variable can take on a yes or no value (such as, a disease is present or it isn’t, or a war happened or it didn’t, or a student had perfect attendance or not), and 1 indicates yes, that thing happened or exists or is present, and 0 indicates it did not/doesn’t.

Finally, in many datasets, the first variable gives us lots of information about the unit of analysis (in the above example, we can infer that our unit of analysis is individual people), though we’ll see examples later in the book where we actually need to inspect more than one variable to identify the unit of analysis.

Constructing tables with dictionaries#

For most of our work in this book we will use the pandas package to turn existing external datasets into tables, or what pandas refers to as dataframes. But, it’s worth knowing how to construct tables on our own, too. To do this, we will use a Python dictionary, which allows us to store data in key: value pairs.

mydictionary = {'key1': 'element1', 'key2': 'element2'}
mydictionary
{'key1': 'element1', 'key2': 'element2'}

To use a real example, we could create a dictionary of birthday months that map to people’s names. Note the use of ' ' because our elements are all strings.

birthday = {'Prof. JR': 'August', 'Erika': 'October', 'Taylor': 'December'}
print(birthday)
{'Prof. JR': 'August', 'Erika': 'October', 'Taylor': 'December'}

Because we have stored this data as a dictionary, we can access the elements by inputting the keys, such as:

birthday['Taylor']
'December'

We can use this exact same logic to build a table of data. Let’s start with one row, or observation:

mytable = [{'name': 'Prof. JR', 'birthday': 'August', 'zodiac': "Leo"}]
mytable
[{'name': 'Prof. JR', 'birthday': 'August', 'zodiac': 'Leo'}]

Now we add a second and third row to our table (notice the comma at the end of each row!):

mytable = [{'name': 'Prof. JR', 'birthday': 'August', 'zodiac': "Leo"},
           {'name': 'Erika', 'birthday': 'October', 'zodiac': "Libra"},
           {'name': 'Taylor', 'birthday': 'December', 'zodiac': "Sagittarius"}]
mytable
[{'name': 'Prof. JR', 'birthday': 'August', 'zodiac': 'Leo'},
 {'name': 'Erika', 'birthday': 'October', 'zodiac': 'Libra'},
 {'name': 'Taylor', 'birthday': 'December', 'zodiac': 'Sagittarius'}]

Pandas DataFrames#

The table above is not particularly easy to read or interpret, but we can fix that by turning mytable into a Pandas DataFrame with the following code:

import pandas as pd
df = pd.DataFrame(mytable)
df
name birthday zodiac
0 Prof. JR August Leo
1 Erika October Libra
2 Taylor December Sagittarius

Ta-da! We’ve made something that looks like a table of data – because it is!

Conclusion#

We’ve thrown a lot at you in this chapter, especially if you’re brand new to programming. Truly, the best way to internalize and understand everything we’ve covered is to go through all the code in your own notebook, making modifications and testing things out (such as, what happens if I don’t include a comma above?), to get a feel for how it all works. The more comfortable you are with the techniques and concepts presented here, the more you can focus on the more complicated concepts related to statistics and machine learning in later chapters.

Most of all (and we mean this in the least patronizing way possible): have fun! Coding can be tedious and debugging can make all of us want to tear our hair out, but we also find it immensely gratifying, satisfying, and sometimes even meditative – as well as fun to build something that does exactly what you want. We hope you enjoy your newfound powers as much as we do!