9  Using linear regression to predict a number

9.1 The basics

Linear models are something you’ve understood since you took middle school math and learned the equation of a line. Remember y = mx + b? It’s back. And, unlike what you complained bitterly in middle school, it’s very, very useful.

What a linear model says, in words is that we can predict y if we multiply a value – a coefficient – by our x value offset with b, which is really the y-intercept, but think of it like where the line starts. Or, expressed as y = mx + b: points = true_shooting_percentage * ? + some starting point. Think of some starting point as what the score should be if the true_shooting_percentage is zero. Should be zero, right? Intuitively, yes, but it won’t always work out so easily.

What we’re trying to do here is predict how many fantasy points a player should score given their draft position (and later other stats). For this, we’ll look at wide receivers, and we’re going to build a model based on the past 10 draft classes.

9.2 Feature engineering

First we’ll need libraries. You might need to install corrr with install.packages("corrr") run in your console.

library(tidyverse)
library(tidymodels)
library(corrr)

And we’ll need some data. In this case, we’ve got draft data from cfbfastR and fantasy data from Pro-Football Reference.

fantasy <- read_csv("https://mattwaite.github.io/sportsdatafiles/fantasyfootball20132022.csv")

wr <- read_csv("https://mattwaite.github.io/sportsdatafiles/wr20132022.csv")

wrdrafted <- wr |> 
  inner_join(fantasy, by=c("name"="Player", "year"="Season"))

That leaves us with a dataframe of 263 observations – drafted wide receivers with their fantasy stats attached to them.

Let’s thin the herd here a bit and just get our selected stats for modeling. We’re really just going to have a handful of things: name, year, their college team and the team that drafted them, their overall draft number, their pre-draft grade from ESPN and the number of fantasy points they scored in their first year in the league.

wrselected <- wrdrafted |>
  select(
    name,
    year,
    college_team,
    nfl_team,
    overall,
    pre_draft_grade,
    FantPt
  )

9.3 Setting up the modeling process

With most modeling tasks we need to start with setting a random number seed to aid our random splitting of data into training and testing.

set.seed(1234)

Random numbers play a large role in a lot of data science algorithms, so setting one helps our reproducibility.

After that, we split our data. There’s a number of ways to do this – R has a bunch and you’ll find all kinds of examples online – but Tidymodels has made this easy.

player_split <- initial_split(wrselected, prop = .8)

player_train <- training(player_split)
player_test <- testing(player_split)

Let’s start with a simple linear regression with one variable. We’re just going to use the overall draft position to predict fantasy points. How well does the draft pick do that? are top picks big point getters and low picks low point scorers? Is there a pattern?

A lot of what comes next is familiar to you. We’re going to make a model, make a fit, then add the results to our training data and see where that gets us. The fit is made up of the FantPt and overall divided by a ~, which can be verbalized as “is approximately modeled by.” So FantPt is approximately modeled by overall.

Our metrics, though will say new things.

9.3.1 Exercise 1: Your first fit

lm_model <- linear_reg() |>
    set_engine("lm")

fit_lm <- lm_model |>
  fit(?????? ~ ???????, data = player_train)

We can look now at the pieces of the equation of a line here.

tidy(fit_lm, conf.int = TRUE)
# A tibble: 2 × 7
  term        estimate std.error statistic  p.value conf.low conf.high
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
1 (Intercept)   89.8      5.48       16.4  4.10e-38   79.0     101.   
2 overall       -0.365    0.0426     -8.57 3.58e-15   -0.449    -0.281

The two most important things to see here are the terms and the estimates. Start with overall. What that says is for every pick in the draft, a player should score about a third of a fantasy point less than the previous pick. So the first pick scores -.3, the second pick scores -.6 and so on. So the higher the pick, the lower the number. Thus our slope.

HOWEVER, the intercept has something to say about this. What the intercept says is that players start with about 87 fantasy points. The slope then adjusts them downwards each pick they go.

Think again about y = mx + b. We have our terms here: y is fantasy points, m is -.3 x the pick number and b is 87. Let’s pretend for a minute that you were drafted with the 10th pick. That would mean, on the slope, you’re down 3 points, so -3 + 87 is 84. Our model would predict the 10th pick of the draft, if they were a wide receiver, would score 84 fantasy points in their rookie season.

9.3.2 Exercise 2: What is the truth?

That sounds good, but how good is our model?

trainresults <- player_train |>
    bind_cols(predict(fit_lm, player_train))

metrics(trainresults, truth = ??????, estimate = .pred)
# A tibble: 3 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard      41.8  
2 rsq     standard       0.280
3 mae     standard      32.9  

Our first step in evaluating a linear model is to get the r-squared value. The yardstick library (part of Tidymodels) does this nicely. We tell it to produce metrics on a dataset, and we have to tell it what the real world result is (the truth column) and what the estimate column is (.pred).

We have two numbers we’re going to focus on – rsq or r squared, and rmse or root mean squared error. R squared is the amount that changes in overall predict changes in fantasy points. You can read it as a percentage. So changes in overall draft position account for about 28 percent of the change in fantasy points. Not great, but we’re just starting.

The rmse is how off your predictions are on average. In this case, our fantasy point prediction is off by 40 (plus or minus) on average. Given that we started with 87, that’s also not great.

9.3.3 Exercise 3: How does it fare?

We need to make those numbers smaller. But first, we should see how it does with test data.

testresults <- player_???? |>
    bind_cols(predict(fit_lm, player_????))

metrics(testresults, truth = FantPt, estimate = .pred)
# A tibble: 3 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard      41.4  
2 rsq     standard       0.291
3 mae     standard      31.4  

Our r squared is up a bit, but so is our rmse. So we didn’t change all much, which is good. That means our model is robust to new data.

9.4 Multiple regression

The problem with simple regressions? They’re simple. Anyone who has watched a sport knows there’s a lot more to the outcome than just one number.

Enter the multiple regression.

Multiple regressions are a step toward reality – where more than one thing influences the outcome. However, the more variance we attempt to explain, the more error and uncertainty we introduce into our model.

To add a variable to your regression model to make a multiple regression model, you simply use + and add it in. Let’s add pre_draft_grade.

9.4.1 Exercise 4: Adding another variable

lm_model <- linear_reg() |>
    set_engine("lm")

fit_lm <- lm_model |>
  fit(FantPt ~ overall + ???_?????_?????, data = player_train)

Let’s look at the pieces of the equation of a line again.

tidy(fit_lm, conf.int = TRUE)
# A tibble: 3 × 7
  term            estimate std.error statistic   p.value conf.low conf.high
  <chr>              <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>
1 (Intercept)       56.8     27.4         2.07 0.0398       2.68    111.   
2 overall           -0.302    0.0756     -3.99 0.0000953   -0.451    -0.153
3 pre_draft_grade    0.369    0.294       1.25 0.212       -0.212     0.950

So our intercept is five points lower. Our overall estimate is about the same – each pick lowers the fantasy point expectation – but the pre_draft_grade now adds five tenths of a point for each point of pre-draft grade.

How does that impact our draft model?

trainresults <- player_train |>
    bind_cols(predict(fit_lm, player_train))

metrics(trainresults, truth = FantPt, estimate = .pred)
# A tibble: 3 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard      41.9  
2 rsq     standard       0.287
3 mae     standard      32.7  

Huh. Our r squared is almost unchanged and our rmse is up. What gives?

There are multiple ways to find the right combination of inputs to your models. With multiple regressions, the most common is the correlation matrix. We’re looking to maximize r-squared by choosing inputs that are highly correlated to our target value, but not correlated with other things. Example: We can assume that overall draft pick and pre-draft grade are highly correlated to fantasy points, but the problem lies in if they are highly correlated to each other. If so, we’re just adding error and not getting any new predictive value.

Using corrr, we can create a correlation matrix in a dataframe to find columns that are highly correlated with our target – FantPt. To do this, we need to select the columns we’re working with – overall and pre_draft_grade.

wrselected |> 
  select(FantPt, overall, pre_draft_grade) |> 
  correlate()
# A tibble: 3 × 4
  term            FantPt overall pre_draft_grade
  <chr>            <dbl>   <dbl>           <dbl>
1 FantPt          NA      -0.532           0.451
2 overall         -0.532  NA              -0.814
3 pre_draft_grade  0.451  -0.814          NA    

Reading this when you have a lot of numbers can be a lot, and it helps to take some notes as you go.

You read up and down and left and right – it’s a matrix. Follow the FantPt row across to the overall column and you’ll see they’re about 50 percent negatively correlated – -1 is a perfect negative correlation. Now look to the right to pre_draft_grade. They’re almost the same – but this time it’s about 45 percent positively correlated.

Now look at the overall column, and go down to the pre_draft_grade. The correlation: -0.8141840. Remember that a -1 is a perfect negative corelation. For every 1 one goes up, the other goes down 1. That’s really close to -1.

What does that mean? It means including both is going to just add error without adding much value. They’re so similar. You pick the one that is more highly correlated with fantasy points – overall pick.