10.2. Classification examples#

Example 1: credit card fraud#

Banks and other financial institutions monitor the use (and abuse) of credit cards. They want to know if a particular transaction is

  • “usual” (\(Y=0\))

  • “fraudulent” (\(Y=1\))

Here, the type of transaction is the outcome or dependent variable (\(Y\)) and take one of two values: \(0\) or \(1\). Rather than being the numbers themselves, these are discrete categories. The categories are mutually exclusive meaning that a transaction cannot be in both categories, and they are jointly exclusive meaning that there are no other categories.

The value – the category – of \(Y\) is predicted by a set of \(X\) variables. These variables come from the credit card owner’s personal purchase history (e.g. the location they usually buy things, the types of items they regularly purchase etc). There may also be \(X\) variables that come from other credit card holder’s purchase histories. Indeed, other customers will be a key source of information about fraudulent transactions insofar as a given customer may never have reported a previous transaction as fraudulent but other customers will have done. So, for example, transactions that are very large relative to a customer’s usual amounts (say \(\$2000\) in a single purchase, where the average is more like \(\$6\)), and for items that are known to be commonly purchased with stolen credit cards (like bottles of hard liquor that can be relatively easily resold for cash) will be flagged as fraudulent.

Ultimately, the goal is to use past relationship between \(X\) and \(Y\) to predict the future class of transactions. Of course, it may be somewhat helpful to know that a previous purchase should not be billed to the customer; but the immediate goal is to decline fraudulent purchases as they happen—and potentially act on that information by e.g. alerting law enforcement. With this in mind, ambiguous cases, i.e. cases where we cannot easily assign the correct category, take on special importance. Suppose, for example, the purchase was of alcohol, but not very expensive alcohol, yet far from the bank customer’s home. Should that be classified as legitimate or fraudulent? Making a mistake either way is obviously costly: someone gets away from stealing from the bank, or we end up calling the police on our customer. Such boundary cases are very difficult.

Example 2: dating websites#

Many websites attempt to “recommend” people or services or goods to other people. In a sense, matchmaking sites are not that different. We want our customers to find a match (\(Y=1\)), rather than a miss (\(Y=0\)). On a dating website, we will typically have a lot of \(X\) variables to use for that assessment: age, preferences, sexual orientation, location and so on. The other thing we have for (at least some) users is past successful and failed matches: we might know when they contacted a potential date (and when they didn’t), or indeed when customers like them (similar \(X\)s) stopped looking all together – perhaps because they are now in relationships.

But as with the credit card fraud example, mistakes are far from costless. Customers of a dating site presumably get bored of the service if they don’t receive good matches. And it’s worth thinking about whether we would generally prefer we match them with “too many” people such that many are ultimately unsuitable, or “too few”, such that they the people with whom they do match are likely to be a better fit, but they come along too rarely to sustain a reasonable dating life. In this sense, the problem is not that different from recommending movies or books.

Classification and Prediction#

As should be clear from these examples, prediction in such problems is about classification. We want to use the relationships between the variables in our data to predict the likely class outcome (category) of a new observation:

  • Should we regard this transaction as fraudulent (or not)?

  • Is this person a good match (or not)?

  • Will this movie appeal to the user (or not)?

The general structure of the problem is that we will have a past relationship between \(X\) and \(Y\). In “deployment” – that is, what we want to use the model for – we will observe the \(X\)s for some new observation (not in our current data set) and try to predict its class. A common tool for this procedure is machine learning. For us this will describe

a set of techniques that use data in some “automated” way to improve performance on some task.

We say “automated” to convey the idea that the computer will not need explicit instructions about how to combine inputs, but will learn this relationship by observing previous decisions. We will impose certain types of rules and constraints on the problem at hand, but there is a notion that it is a somewhat “hands off” process relative to other statistical approaches.

In what follows, we will specifically look at supervised machine learning, which means that the model has access to both explicit inputs and explicit outputs, which are typically coded by us, originally. For example, we have decided that a particular type of transaction is a case of a “fraudulent” one and we pass that in to the learning process for the computer.

We will get to the building blocks momentarily, but just be aware that while this is an exciting and fast-moving area – or perhaps precisely because of this – there is a lot of unhelpful “hype” around machine learning. In fact, machine learning can be done with very simple techniques, including the linear model, as we shall see.

Building blocks of (supervised) machine learning#

Machine learning uses some new concept and new terminology and it is helpful to get familiar with it:

  • The units (the transaction, the website customer etc) about which we wish to make predictions are the observations or instances.

  • Those observations have attributes, sometimes called features or covariates. These are the \(X\) variables that will help us predict \(Y\).

  • The \(Y\) are classes. For a given instance, we want to predict the class it should be in (using its features). This could be binary (e.g., “match”/”no match”) or something more elaborate (e.g. a predicted star rating that a consumer will give a product we advertise to them).

  • For obvious reasons, we sometimes call a machine learning model of this kind a classifier.

We will have historical data on the relationship between \(X\) and \(Y\). We call this the training data or training set. It is:

a set of observations for which we already know \(X\) and \(Y\) from which we want to machine to learn the way that \(X\) and \(Y\) are related.

This will enable us to predict the appropriate category for new observations. All things equal, we prefer that our classifier gets more predictions correct, but a model does not need to be 100% accurate to be extremely helpful in practice. To assess the accuracy of our classifier, we will use test data or a test set. This is:

a set of observations for which we already know \(X\) and \(Y\) and for which we want the machine to produce predictions, such that we can assess how well it does on data it has never “seen before” (i.e. that was not in its training set).

The test set will help us mimic the process of “new” data arriving when we try to deploy our model.