6.5. Variable types#

Variable type are one more key thing we’ll want to be thinking about when we evaluate our data and as we go into more sophisticated analyses in the second half of the book. We previously saw data types, which are how data is stored in the computer – perhaps as an int, a float, a Bool, and so on. A variable type, on the other hand, refers to the characteristics of the actual thing we are studying in the world. There are a number of ways to think about the way variable types show up in the world, but we’ll focus on three: continuous, discrete, and categorical.

When we are working with data, a variable is a placeholder for an unknown quantity. In our horses data, we had variables for things like price, height, weight, color, and so on. These variable names tell us the sort of unknown quantity that might occupy that column – so, we can imagine that price data will include a number of some kind, whereas color data probably contains a category reflecting the color of the horse (e.g., bay, chestnut, etc). This, a variable type is the kind of variable we are working with – is it a number, or a category?

The data type, then, refers to how that variable is then stored in the computer. Generally, we might expect or at least hope that our numeric data will be stores as a float or an int (and if it’s not, we’ll have to manually change types if we want to perform mathematical operations on it, for example if it’s imported with a $ or a %, both of which will cause python to import that variable as a string). If we have categorical data, we can anticipate it will be imported as a string, or object. And, again, if it isn’t, we may need to adjust things.

Just as data types have an influence on what we can do with our data computationally, variable types influence the kind of operations and research we can perform as well. Sometimes this is directly related to their data type (we can’t find the mean of a categorical variable that’s stored as a string), and sometimes it’s more conceptual (even if we could assign a number to each color of a horse (e.g., bay = 0, chestnut = 1, and so on), it’s not obvious that doing so would make sense from an inferential perspective. We could argue that it might mean something if we’re organizing it from lighter to darker, or solid to spotted, or something, but this is a stretch, and it’s still not really clear what the mean of this would tell us, even though we can technically calculate it mathematically.

We will use variable types a lot going forward, as they will have real implications for the types of analyses we conduct, including the types of machine learning algorithms we can conduct and learn from.

Continuous numeric variables#

We will work with two main types of numeric variables: continuous and discrete. Continuous variables is data that, in principle, could take on any numeric values. Things like height, weight, percentages, and temperatures all in principle could take on any value. For example, a person could be 6.0 feet tall, but we could also be more precise and say they are 6.1, 6.11, 6.111111, or 6.111111111111111 (etc.) feet tall.

Discrete numeric variables#

Discrete numeric variables can take on a finite set of values. The number of people in a household, the number of cars on a highway, or the grade level of a person in school can all only take on some values. You might have 1, 2, 3, 4, or more people in a household, but you will never have 2.11111111.

The distinction between continuous and discrete variables might seem conceptually simple – and it is, really, on some level – as the former can take on infinite values and the latter can take on finite. In practice, however, the distinction can sometimes get blurry. For example, we could in principle measure someone’s shoe size or GPA as taking on any conceivable value between the smallest and biggest ever recorded feet and GPAs, but we almost never measure them as such. I at least have not ever seen a shoe size of 6.334912432 or a GPA of \(pi\) (as much as I’d kind of love to). In these cases, whether we think of our variables as continuous or discrete depends on the particular data we have in front of us, as well as what we hope to do with that data, and how we see the world.

Really, whereas data types are something we can literally ask the computer to tell us, variable types can end up being a somewhat philosophical matter. Time, for example, is technically continuous, but we measure it in discrete ways, like by recording the year, or our age. (For the record, this is our most philosophically exhausting chapter, at least we hope!)

Categorical variables#

Categorical variables are variables that represent non-numeric categories, like color, religion, zodiac sign, or classes that you’re taking. Purists will tell you that there are two kinds of categorical variables, ordinal and nominal. While the distinction can matter, in this book it won’t really, so we’ll generally simply say “categorical” rather than make a further distinction.

But so that we have it: an ordinal categorical variable is one that is not numeric, but that the order the categories are in means something. A common example of this is a “Likert scale” in survey questions, which is when you give respondents an option of indicating their agreement with a statement as something along the lines of “strongly agree, agree, neutral, disagree, strongly disagree.” Here, we’re ultimately measuring and recording categories, but they do go in a particular order from high agreement to low. Thus, in principle, we could recode these answers as, say, “5, 4, 3, 2, 1” and perform mathematical operations on them, even though they are not inherently numeric. Unlike our above example where we discussed finding the mean of a color (which hurts my head to think about), the mean might even, well, mean (ha) something.

That said – we advise proceeding with caution. Whenever you are working with ordinal data that you’ve turned into numbers, it’s worth always being transparent with all results and inferences that this transformation has taken place, and to remind your audience that you aren’t really working with numeric data. (You might be surprised how easy it is to forget this, and how often I’ve had to remind smart professionals that their people are not “fives” or “fours”.)

As with the grey zone between discrete and continuous numeric variables in some cases, there can also be some confusion between discrete numeric and ordinal variables. If we are turning, e.g., Likert scale answers into numbers, aren’t we really now working with discrete numeric variables? Again, here we advise caution. Generally, variable type refers to the underlying phenomenon you are studying. While it won’t likely ruin your study if you think of your data as discrete numeric, it is dishonest when you present your results if you fail to mention that what you’re really measuring is a category – because what you really asked was a categorical question.

On the other hand, if you’re working with data where you’ve specifically asked respondents to provide a number, such as the American National Election Studies’ feeling thermometer, where they ask Americans to give a number that reflects their favorability towards political actors, such as the president, Congress, and different political parties, where 0 is very unfavorable (cold) and 100 is very favorable (hot), you could (and people do) make a compelling case that you’re working with discrete numeric variables. Ultimately, as with so much of data science, it depends on your data, the underlying phenomenon, and your goals.

Finally, a nominal categorical variable is one where the order doesn’t matter. Things like brand of shoe, type of cereal, favorite animal, or whatever do not come with a particular order. You could impose one (smallest to largest, or whatever), but inherently these are not things where the order of categories really comes with any meaning.

Conclusion#

I don’t know about you, but this chapter required a lot of philosophical thinking, which is fun, but also exhausting. All of the concepts in this chapter will come up again and again in this book as we introduce new datasets, as well as in your life as you encounter new data, whether in your own work or when evaluating that of others. They really are at the backbone of thinking like a data scientist, and even if you never do a single statistical analysis ever in your life after this course (though we hope that’s not the case!), being empowered with these ideas will go a long way towards helping you be a more sophisticated producer and consumer of data and data science.