Linear regression: prediction and diagnostics
Contents
9.7. Linear regression: prediction and diagnostics#
Let’s refit our linear regression model to our data.
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
states_data = pd.read_csv("data/states_data.csv")
def standard_units(any_numbers):
"Convert any array of numbers to standard units."
return (any_numbers - np.mean(any_numbers))/np.std(any_numbers)
def correlation(t, x, y):
return np.mean(standard_units(t[x])*standard_units(t[y]))
def slope(t, label_x, label_y):
r = correlation(t, label_x, label_y)
return r*np.std(t[label_y])/np.std(t[label_x])
def intercept(t, label_x, label_y):
return np.mean(t[label_y])-\
slope(t, label_x, label_y)*np.mean(t[label_x])
model_slope = slope(states_data, 'cig_tax12', 'smokers12')
model_intercept = intercept(states_data, 'cig_tax12', 'smokers12')
print("intercept...")
print(model_intercept)
print("\n")
print("model slope...")
print(model_slope)
intercept...
23.388876340779785
model slope...
-1.5908929289147922
Simple Prediction#
Now that we have the slope and the intercept, we can provide simple predictions. In particular, if you give me an
So, if the tax per pack is
or 15.44%. If the tax becomes a subsidy (i.e. it reduces the cost of a pack) of
or 39.29%. Notice that nothing in the model forces the predictions to be “sensible” (i.e. forces the percentage to be between 0 and 100). For example, if we raise the tax to $17 per packet, our prediction is that the percentage of smokers is
which implies the percentage of the state population that smokes is
Just for completeness, here is a simple function that predicts a
def fit(dframe, x, y):
"""Return the value of the regression line at each x value."""
beta = slope(dframe, x, y)
constant = intercept(dframe, x, y)
return (beta*dframe[x]) + constant
Here we apply the function to our entire dataset just to get the prediction line (the linear regression line) for every observation:
regression_prediction = fit(states_data, 'cig_tax12','smokers12')
plt.scatter(states_data['cig_tax12'], states_data['smokers12'],
edgecolors="black")
plt.plot(states_data['cig_tax12'], regression_prediction,
color="black", linewidth=2)
plt.show()

Residuals#
For every actual observation in our data, we have a prediction for it from the regression model. This is what the line tells us, and you can think of it as what we would expect the state’s

Fig. 9.8 The far right portion of our earlier regression line#
We can see that the actual observation for New York (NY) is above the regression line. Meanwhile, the actual observation for Hawaii (HI) is below the regression line. This tells us that, in some sense, New York’s tax (
In fact, we can be more precise. The actual value for New York’s
def make_reg_prediction(state="NY"):
states_data['clean_stateid'] = states_data['stateid'].str.strip()
sd = states_data[states_data['clean_stateid']==state]
pred = model_slope*sd['cig_tax12'] + model_intercept
predf = float(pred)
return(predf)
For New York, this is:
make_reg_prediction(state="NY")
16.46849210000044
This brings us to an important term. We can define a residual as being the difference between
for that observation
For New York, the residual is thus:
We say this is a positive residual because it is greater than zero. Meanwhile, Hawaii had a negative residual, because this number (
We will use the sum of the squared residuals momentarily, but for now notice that there are some important facts about residuals themselves:
residuals in a regression sum to zero. The intuitive reason is that, if they didn’t—that is, if the positive ones weren’t completely offset by the negative ones—we could alter
and little to get the sum to zero. And that would suggest a better fitting line. Similarly, the residuals are on average zero.the residuals are uncorrelated with our predictor variable (our
). This follows from fact 1, but we won’t explore exactly why in this course.the residuals are informative about model fit. That is, they can tell us how well our straight line describes the data. We will return to this idea below.
Residuals and the best fit line#
In what sense does is our linear regression line the line that fits the data “best”? Well, it is the line that minimizes the all the residuals. Specifically, it is the ordinary least squares line that
minimizes the sum of the squared residuals
For obvious reasons, the ordinary least squares line, and indeed the technique in general, is sometimes referred to by its initials, OLS.
Best fit line#
We are fitting via OLS, which minimizes the sum of squared residuals. Literally, it is finding a value of
This is another way of describing the residual, above. For a given row of the data, the
The idea is that the sum,
this is added to the second squared residual
And so on, down the data. Intuitively, it is as if we are trying out different values of
Now, in fact, we don’t literally try each possible value of
Residuals and diagnostics#
As the logic above suggests, it is straightforward to calculate the residual for any given observation. Ideally, we would like these residuals to be approximately the same size, no matter what the particular value of
Another way to put this is that we don’t want the line to fit systematically better for some values of
The consequences of heteroscedasticity are twofold from our perspective:
first, it can generate problems with interpreting coefficients. In particular, it can lead to make type-I or type-II errors when declaring that a specific coefficient (so the
for a specific variable) is “statistically significant” or notsecond, it is emblematic of a specification error. That is, non-constant error variance potentially tells that a linear regression is not appropriate, and perhaps we should have allowed for a non-linear model (line) through our data instead.
To get a sense of the diagnosis procedure, let’s start by writing a simple function to calculate the residuals for a given regression of fit()
function above):
def residual(dframe, x, y):
return dframe[y] - fit(dframe, x, y)
To get a sense of how constant error variance looks, we can generate some (fake) data which we know does not suffer from heteroscedasticity. We won’t belabor the details, but we make our residual
function from the regression of the (fake)
np.random.seed(5)
fake_x = np.random.normal(3,1, 200)
fake_y = np.random.normal(2,2, 200)
fake_data = pd.DataFrame().assign(x=fake_x, y=fake_y)
perfect_residuals = residual(fake_data, 'x', 'y')
Now, we just want to plot these (fake) residuals in terms of the value of the predictor variable on the
plt.scatter(fake_data['x'], perfect_residuals,
color="red", edgecolors="black")
plt.ylabel("Residuals")
plt.xlabel("x")
plt.xlim(0, None)
plt.axhline(0, color="blue")
plt.grid()
plt.show()

That is approximately what we see: as we move from low values of
What about for our regression above?
model_residuals = residual(states_data, "cig_tax12", "smokers12")
plt.scatter(states_data['cig_tax12'],
model_residuals, color="red", edgecolors="black")
plt.ylabel("Residuals")
plt.xlabel("Tax")
plt.xlim(0, None)
plt.axhline(0, color="blue")
plt.grid()
plt.show()

Here, the error variance for low values of the tax variable are more evenly spread than for high values. And, for high tax states, the residuals are smaller. This suggests our model does a better job of fitting the observations for high-tax states. In general then, it seems we may have some non-constant error variance.
For an even sharper example, consider a regression of the percentage of a state’s population that is pro-Choice (prochoice
) on its proportion of Hispanic residents (hispanic08
).
model2_residuals = residual(states_data, 'hispanic08', 'prochoice')
plt.scatter(states_data['hispanic08'],
model2_residuals, color="red", edgecolors="black")
plt.ylabel("Residuals")
plt.xlabel("Hispanic Population")
plt.xlim(0, None)
plt.axhline(0, color="blue")
plt.grid()
plt.show()

We see something quite different here. For low Hispanic population values, the error variance looks somewhat constant. But for mid-levels, the residuals are mostly above the line (we underpredict for these states). For relatively large values of Hispanic population, the residuals are mostly below the line (we overpredict for these states). This suggests that
Indeed, looking at the regression line for this example suggests it should perhaps be more of a curve:
regression_prediction2 = fit(states_data, 'hispanic08','prochoice')
plt.scatter(states_data['hispanic08'], states_data['prochoice'],
edgecolors="black")
plt.plot(states_data['hispanic08'], regression_prediction2,
color="black", linewidth=2)
plt.show()

Model fit#
How well does our regression fit our data? One common way to think about it is in terms of the amount of variation in
The particular measure we will use is called R-squared or R-square or
the proportion of the variance in the outcome (
) that our linear regression explains; and it is the square of the correlation between and .
Perhaps unsurprisingly, when
As with the slope and the intercept, there are many ways to calculate
From the formula, we can see that when the variance of the fitted values is the same as the variance of
The fact that
def r2(t, x, y):
return correlation(t, x, y)**2
Applied to our model, we see that proportion of variance explained is about 0.19 (19% of the variance in
r2(states_data, "cig_tax12", "smokers12")
0.18619559936896055