10  Random forests to predict a number

10.1 The basics

And now we return to decision trees and random forests. Recall that tree-based algorithms are based on decision trees, which are very easy to understand. A random forest is, as the name implies, a large number of decision trees, and they use a random choice of inputs at each fork in the tree. The algorithm creates a large number of randomly selected training inputs, and randomly chooses the feature input for each branch, creating predictions. The goal is to create uncorrelated forests of trees. The trees all make predictions, and the wisdom of the crowds takes over.

This time, we’re going to clean up our code a bit and make it more like we are accustomed to with our previous work, where we’ll make recipes and workflows.

As always, we start with libraries.

library(tidyverse)
library(tidymodels)

set.seed(1234)

And we’ll need some data. We’ll use our draft data from cfbfastR and fantasy data from Pro-Football Reference.

fantasy <- read_csv("https://mattwaite.github.io/sportsdatafiles/fantasyfootball20132022.csv")

wr <- read_csv("https://mattwaite.github.io/sportsdatafiles/wr20132022.csv")

wrdrafted <- wr |> 
  inner_join(fantasy, by=c("name"="Player", "year"="Season"))

And again we have a dataframe of 263 observations – drafted wide receivers with their fantasy stats attached to them.

Let’s narrow that down to just our columns we need:

wrselected <- wrdrafted |>
  select(
    name,
    year,
    college_team,
    nfl_team,
    overall,
    pre_draft_grade,
    FantPt
  ) |> na.omit()

Before we get to the recipe, let’s split our data.

player_split <- initial_split(wrselected, prop = .8)

player_train <- training(player_split)
player_test <- testing(player_split)

Now we’re ready to start.

10.1.1 Exercise 1: What data are we feeding the recipe?

player_recipe <- 
  recipe(FantPt ~ ., data = player_?????) |> 
  update_role(name, year, college_team, nfl_team, new_role = "ID")

summary(player_recipe)
# A tibble: 7 × 4
  variable        type      role      source  
  <chr>           <list>    <chr>     <chr>   
1 name            <chr [3]> ID        original
2 year            <chr [2]> ID        original
3 college_team    <chr [3]> ID        original
4 nfl_team        <chr [3]> ID        original
5 overall         <chr [2]> predictor original
6 pre_draft_grade <chr [2]> predictor original
7 FantPt          <chr [2]> outcome   original

Now, we’re going to create two different model specifications. The first will be the linear regression model definition and the second will be the random forest.

linear_mod <- 
  linear_reg() |> 
  set_engine("lm") |>
  set_mode("regression")

rf_mod <- 
  rand_forest() |>
  set_engine("ranger") |>
  set_mode("regression")

Now we have enough for our workflows. We have two models and one recipe.

10.1.2 Exercise 2: making workflows

linear_workflow <- 
  workflow() |> 
  add_model(??????_mod) |> 
  add_recipe(???????_recipe)

rf_workflow <- 
  workflow() |> 
  add_model(??_mod) |> 
  add_recipe(??????_recipe)

Now we can fit our models to the data.

10.1.3 Exercise 3: fitting our models

linear_fit <- 
  linear_workflow |> 
  fit(data = ????_?????)

rf_fit <- 
  rf_workflow |> 
  fit(data = ????_?????)

Now we can bind our predictions to the training data and see how we did.

linearpredict <- 
  linear_fit |> 
  predict(new_data = player_train) |>
  bind_cols(player_train) 

rfpredict <- 
  rf_fit |> 
  predict(new_data = player_train) |>
  bind_cols(player_train) 

Now, how did we do? First, let’s look at the linear regression.

10.1.4 Exercise 4: The first metrics

What prediction dataset do we feed into our metrics?

metrics(?????????????, FantPt, .pred)
# A tibble: 3 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard      40.9  
2 rsq     standard       0.316
3 mae     standard      31.5  

Same as last time. An r squared in the high 20s, an rmse in the 40s. Nothing to write home about. How did the random forest do?

10.1.5 Exercise 5: Random forest metrics

metrics(??predict, FantPt, .pred)
# A tibble: 3 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard      22.6  
2 rsq     standard       0.812
3 mae     standard      17.2  

Hopefully you learned your lesson the last time we did random forests – they have a habit of bringing up your hopes on training only to dash them on testing, especially when you have two highly correlated values.

Is that what happened here?

rftestpredict <- 
  rf_fit |> 
  predict(new_data = player_test) |>
  bind_cols(player_test) 

metrics(rftestpredict, FantPt, .pred)
# A tibble: 3 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard      49.3  
2 rsq     standard       0.157
3 mae     standard      38.2  

Indeed.

Safe to say we’ve reached the limits of overall draft pick and pre-draft grades for predictive value. Time to add more.