3.4. Data types#

Data science requires, of course, working with data. As we will see in this book, this data might come from importing an external data source created by a third party, manually entering data into tables we construct ourselves (which we’ll do at the end of this chapter), or simulating data given a set of constraints and assumptions. Regardless of the source of our data, all data has a type. A data type is:

a classification or categorization of a value that dictates the types of values it may take on, as well as the kinds of operations that may be performed on it.

Data types also have implications for how the information is stored in the computer. We won’t do anything in this book where the type of our data will make a big difference, but there may be reasons in future study or practice where you’ll want to change the type of your data in order to more efficiently store or process it.

Python has a number of built-in data types, as well as built-in functions to inspect and change those types. Some of the most commonly used are:

Numeric: Numbers! These can be integers (int), real numbers (float), or complex numbers
Sequence: An ordered collection of values, such as a word (string), which is made up of letters in a specific order, a list (an ordered collection of elements that we can edit), or a tuple (an ordered collection that is fixed)
Boolean: Variables that can take on one of two values: True or False
Dictionary: A data type that includes a mapping of key to an element
Set: An unordered collection of values

In this book we are going to mostly use numeric, sequence, and Boolean types, and we’ll see dictionaries in action when we construct tables at the end of this chapter. We will not use sets much, but it’s worth knowing they exist. In what follows, we’ll discuss these types in more detail, as well as provide some helpful code (including built-in functions) to work with them.

As a preview: One of the ways you’ll likely frequently interact with data types as a data scientist in the real world is that you’ll have to change data types in your dataset. Often, they will be stored automatically one way (as, e.g., a string), but you need them to be something else (e.g., float) in order to conduct your analyses. We’ll give some examples of this below.

Numeric data#

Not surprisingly, a lot of data is numeric. An int data type is an integer, meaning it’s a whole number (not a fraction). For example, 6 is an integer. 6.2 is not an integer. Note that 6.0 is also not an integer. Both of these are examples of a float.

A float is a real number, or any number on a number line. It represents a value of continuous quantity and can in princple have infinite decimals (think, for example, of $pi$, which has infinite decimals that begin 3.14159….). Any value with a decimal is stored as a float.

A complex number is a real and imaginary number added together (you may recall seeing them written as $a+bi$, where $i$ is an imaginary number (such as the square root of -2). We will only use complex numbers ever so briefly at the end of this book, so don’t worry about them too much for now.

Inspecting data types#

We can inspect the type of our data by using the built-in function type():

type(6)               # find the type of the value 6

int

type(6.2)             # find the type of 6.2

float

type(6.0)             # find the type of 6.0

float

Changing data types#

There are also build-in functions to change the type of our data. For example, int() will turn the value in the parenthesis into an integer, and float() will turn it into a float:

float(6)              # change 6 to a float

6.0

int(7.4)              # change 7.4 to an int

int(7.8)              # change 7.8 to an int. Notice anything unusual?

Notice that int(7.8) returns 7, rather than rounding up to 8. The int function doesn’t round, it simply lobs off the values after the decimal.

Storing our changes#

The code above doesn’t permanently change any of the values in our program. For example, we haven’t now stored the value 6 as a float in our program henceforth. But there may be (many) occasions where we wish to do that. In this case, naming comes in very handy. See, for example:

my_new_float = float(6)            # change the value 6 into a float and store it as my_new_float
                                   # notice, as we learned in the naming section, this code doesn't return anything

print(my_new_float)                # print the value of my_new_float

6.0

We can check our work. First, let’s see that, indeed, we haven’t changed the actual value of 6 to a float in a manner that persists:

type(6)

int

But we can see that the variable my_new_float has been stored as a float:

type(my_new_float)

float

We can also change types of our named variables, such as:

int(my_new_float)                # change my_new_float into an int

Notice that the above line of code is a great example of a common source of a bug – our variable is named my_new_float, but we’ve turned it into an int. This is just one of many things to keep track of when coding that we could forget about and could trip us up. A better practice may be to rename the variable, or create a new one called my_new_int so we remember what each one contains.

Sequences#

Sequences are incredibly important and useful in data science. We will use them a lot in this book, and you’ll continue to do so if you keep practicing data science. A sequence is

an ordered set of elements.

We will frequently store data as sequences (though that’s not their only use). For example, we might want to keep track of a list of ages, heights, or temperatures of different people, horses, or places (hypthetically). Some of the most common sequences are as follows.

Strings#

A string is a sequence of unicode characters. It’s often a word, but it doesn’t have to be. We write strings surrounded by double quotes (" ") or single quotes (' '). You’ll see some special cases where we’ll indicate the presence of a string with triple quotes (''' '''), but those are generally used in only certain circumstances. Whether you prefer to write your strings as "my_string" or 'my_string' is mostly up to you (I personally find a single quote to be more elegant (#art!)), though if your string contains an apostrophe, you’ll need to write it with double quotes. For example "susan's string" requires double quotes; otherwise the apostrophe will be read as the end of the string.

We can use type() with strings, but we can only change some strings to numbers. For example:

type('Susan')              # inspect the type of the value 'Susan'

str

type('6')                   # inspect the type of '6'; note the quotation marks make it a string!

str

int('6')                    # change the string '6' into an int

If we were to write int('Susan'), however, we would get an error, as there’s no numeric value to which the computer can change 'Susan'. Notice also that if we do not include the quotation marks around 'Susan', we will also get an error, because we have not told the computer to read it as a string.

When we import data from the outside world, one relatively common source of trouble is that much data that we think is numeric (e.g., 16% or $2.75) will often be automatically encoded as strings due to the presence of non-numerica characters (% and $). You’ll get lots of practice with this as a data scientist, but the general idea here is you’ll need to write code to first strip out the non-numeric characters, then use our friendly built-in functions to change the data to numeric.

Lists#

A list is an ordered sequence of elements. While strings are technically sequences, and we operate on them like other sequences (which we’ll see in the Programming 2 chapter), we will mostly interact with strings as individual elements. A list, however, is a way of storing multiple elements in a particular order. We write lists inside square brackets [ ] with the elements separated by commas. For example, below we simply write out a list, and then we name a list.

[8, 9, 10, 11]                         # create a list containing elements 8, 9, 10, 11 in that order

[8, 9, 10, 11]

my_list = [8, 9, 10, 11]               # create the same list as above, but this time name it my_list

We can check the type of a sequence just like we did with strings and numeric variables:

type(my_list)

list

A list also does not need to contain only numbers. For example, we could have:

my_diverse_list = [1, 'apple', 'nyc', 1984]

As with strings, we can’t convert a list to a single numeric data type, however (even if it contains only one numeric element, alas!).

We can manipulate elements within a list, however, such as add new elements (.append()), remove them (using del), or isolate just one element (by putting its index (position) inside of []).

Importantly, the indexing (counting, basically) of the elements in a list begins at 0. So in our list my_list above, the number 8 is in position 0, the number 9 is in position 1, and so on. For example:

my_list.append(6)                 # add the element 6 to the end of my_list
my_list

[8, 9, 10, 11, 6]

del my_list[0]                    # delete the element in position 0 in my_list (so, delete the number 8)
my_list

[9, 10, 11, 6]

my_list[2]                        # show the element in position 2 in my_list (so, the number 11)

Note again that the order in which we run cells is extremely important. Because we deleted the element in position [0] before we printed the element in position 2, this shifted from 10 to 11. Had we not deleted the 0th item first, my_list[2] would have returned 10. (As always, try this out on your own to see for yourself and build some intuition.)

You may also be wondering if we can carry out mathematical operations on our lists. For example, could we multiply my_list $* 2$? What might that return? As always, we can find out by experimenting:

my_list * 2

[9, 10, 11, 6, 9, 10, 11, 6]

The result in this case might come as a surprise – rather than multiply each element inside the list $*2$, we doubled the entire list. While lists are very useful, and they’re quick because they’re built into Python, they have some limited functionality when it comes to mathematical operations. To do a big more with sequences, we’ll frequently use arrays. Specifically, we’ll make use of numpy arrays, which require the NumPy package, or library, which we’ll install (and discuss) in the next section.

Data Science for Everyone

Data types

Contents