library(tidyverse)
11 Residuals
When looking at a linear model of your data, there’s a measure you need to be aware of called residuals. The residual is the distance between what the model predicted and what the real outcome is. Take our model at the end of the correlation and regression chapter. Our model predicted Nebraska, given a 5 net yardage margin would beat Iowa by 1.96 points. They lost by 6. So our residual is -7.96.
Residuals can tell you several things, but most important is if a linear model the right model for your data. If the residuals appear to be random, then a linear model is appropriate. If they have a pattern, it means something else is going on in your data and a linear model isn’t appropriate.
Residuals can also tell you who is underperforming and overperforming the model. And the more robust the model – the better your r-squared value is – the more meaningful that label of under or overperforming is.
Let’s go back to our net yards model.
For this walkthrough:
Then load the tidyverse.
<- read_csv("data/footballlogs20.csv") logs
Rows: 1100 Columns: 54
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (8): HomeAway, Opponent, Result, TeamFull, TeamURL, Outcome, Team, Con...
dbl (45): Game, PassingCmp, PassingAtt, PassingPct, PassingYds, PassingTD, ...
date (1): Date
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
First, let’s make the columns we’ll need.
<- logs |> mutate(differential = TeamScore - OpponentScore, NetYards = OffensiveYards - DefYards) residualmodel
Now let’s create our model.
<- lm(differential ~ NetYards, data = residualmodel)
fit summary(fit)
Call:
lm(formula = differential ~ NetYards, data = residualmodel)
Residuals:
Min 1Q Median 3Q Max
-49.479 -8.593 0.128 8.551 48.857
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.311030 0.385651 0.807 0.42
NetYards 0.101704 0.002293 44.345 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 12.78 on 1098 degrees of freedom
Multiple R-squared: 0.6417, Adjusted R-squared: 0.6414
F-statistic: 1967 on 1 and 1098 DF, p-value: < 2.2e-16
We’ve seen this output before, but let’s review because if you are using scatterplots to make a point, you should do this. First, note the Min and Max residual at the top. A team has underperformed the model by nearly 50 points (!), and a team has overperformed it by 49 points (!!). The median residual, where half are above and half are below, is just slightly above the fit line. Close here is good.
Next: Look at the Adjusted R-squared value. What that says is that 64 percent of a team’s scoring output can be predicted by their net yards.
Last: Look at the p-value. We are looking for a p-value smaller than .05. At .05, we can say that our correlation didn’t happen at random. And, in this case, it REALLY didn’t happen at random. But if you know a little bit about football, it doesn’t surprise you that the more you outgain your opponent, the more you win by. It’s an intuitive result.
What we want to do now is look at those residuals. We want to add them to our individual game records. We can do that by creating two new fields – predicted and residuals – to our dataframe like this:
<- residualmodel |> mutate(predicted = predict(fit), residuals = residuals(fit)) residualmodel
Now we can sort our data by those residuals. Sorting in descending order gives us the games where teams overperformed the model. To make it easier to read, I’m going to use select to give us just the columns we need to see.
|> arrange(desc(residuals)) |> select(Team, Opponent, Result, differential, NetYards, predicted, residuals) residualmodel
# A tibble: 1,100 × 7
Team Opponent Result differential NetYards predicted residuals
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Arizona State Arizona W (70… 63 136 14.1 48.9
2 Wisconsin Wake Fore… W (42… 14 -252 -25.3 39.3
3 Kentucky Mississip… W (24… 22 -138 -13.7 35.7
4 Mississippi State Vanderbilt W (24… 7 -274 -27.6 34.6
5 Kansas State Kansas W (55… 41 61 6.51 34.5
6 Boise State Colorado … W (52… 31 -24 -2.13 33.1
7 Army Mercer W (49… 46 139 14.4 31.6
8 Notre Dame South Flo… W (52… 52 198 20.4 31.6
9 Texas Oklahoma … W (41… 7 -243 -24.4 31.4
10 BYU North Ala… W (66… 52 201 20.8 31.2
# ℹ 1,090 more rows
So looking at this table, what you see here are the teams who scored more than their net yards would indicate. One of them should jump off the page at you.
Remember Nebraska vs Rutgers? We won and everyone was happy and relieved the season was over? We outgained Rutgers by 368 yards in that game and won by 7. Our model predicted Nebraska should have won that game by 37 points. We should have blown Rutgers out of their own barn. But Rutgers isn’t as hard done as Arizona, which should have lost by 48, but ended up losing by 63.
But, before we can bestow any validity on this model, we need to see if this linear model is appropriate. We’ve done that some looking at our p-values and R-squared values. But one more check is to look at the residuals themselves. We do that by plotting the residuals with the predictor. We’ll get into plotting soon, but for now just seeing it is enough.
The lack of a shape here – the seemingly random nature – is a good sign that a linear model works for our data. If there was a pattern, that would indicate something else was going on in our data and we needed a different model.
Another way to view your residuals is by connecting the predicted value with the actual value.
`geom_smooth()` using formula = 'y ~ x'
The blue line here separates underperformers from overperformers.
11.1 Penalties
Now let’s look at it where it doesn’t work: Penalties.
<- logs |>
penalties mutate(
differential = TeamScore - OpponentScore,
TotalPenalties = Penalties+DefPenalties,
TotalPenaltyYards = PenaltyYds+DefPenaltyYds
)
<- lm(differential ~ TotalPenaltyYards, data = penalties)
pfit summary(pfit)
Call:
lm(formula = differential ~ TotalPenaltyYards, data = penalties)
Residuals:
Min 1Q Median 3Q Max
-67.007 -14.699 0.265 14.047 64.993
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.21614 1.79282 0.678 0.498
TotalPenaltyYards -0.00349 0.01556 -0.224 0.823
Residual standard error: 21.36 on 1098 degrees of freedom
Multiple R-squared: 4.58e-05, Adjusted R-squared: -0.0008649
F-statistic: 0.05029 on 1 and 1098 DF, p-value: 0.8226
So from top to bottom:
- Our min and max go from -67 to positive 65
- Our adjusted R-squared is … -0.0008935. Not much at all.
- Our p-value is … 0.8307, which is more than than .05.
So what we can say about this model is that it’s statistically insignificant and utterly meaningless. Normally, we’d stop right here – why bother going forward with a predictive model that isn’t predictive? But let’s do it anyway.
$predicted <- predict(pfit)
penalties$residuals <- residuals(pfit) penalties
|> arrange(desc(residuals)) |> select(Team, Opponent, Result, TotalPenaltyYards, residuals) penalties
# A tibble: 1,100 × 5
Team Opponent Result TotalPenaltyYards residuals
<chr> <chr> <chr> <dbl> <dbl>
1 Clemson Georgia Tech W (73-7) 60 65.0
2 Arizona State Arizona W (70-7) 119 62.2
3 Alabama Kentucky W (63-3) 99 59.1
4 Marshall Eastern Kentucky W (59-0) 55 58.0
5 Texas Texas-El Paso W (59-3) 85 55.1
6 Pitt Austin Peay W (55-0) 57 54.0
7 Oklahoma Kansas W (62-9) 100 52.1
8 Wake Forest Campbell W (66-14) 82 51.1
9 BYU North Alabama W (66-14) 74 51.0
10 Notre Dame South Florida W (52-0) 65 51.0
# ℹ 1,090 more rows
First, note all of the biggest misses here are all blowout games. The worst games of the season, the worst being Clemson vs Georgia Tech. The model missed that differential by … 65 points. The margin of victory? 66 points. In other words, this model is terrible. But let’s look at it anyway.
Well … it actually says that a linear model is appropriate. Which an important lesson – just because your residual plot says a linear model works here, that doesn’t say your linear model is good. There are other measures for that, and you need to use them.
Here’s the segment plot of residuals – you’ll see some really long lines. That’s a bad sign. Another bad sign? A flat fit line. It means there’s no relationship between these two things. Which we already know.
`geom_smooth()` using formula = 'y ~ x'