4 Decision trees and random forests

4.1 The basics

Tree-based algorithms are based on decision trees, which are very easy to understand. A decision tree can basically be described as a series of questions. Does this player have more or less than x seasons of experience? Do they have more or less then y minutes played? Do they play this or that position? Answer enough questions, and you can predict what that player should have on average.

The upside of decision trees is that if the model is small, you can explain it to anyone. They’re very easy to understand. The trouble with decision trees is that if the model is small, they’re a bit of a crude instrument. As such, multiple tree based methods have been developed as improvements on the humble decision tree.

The most common is the random forest.

Let’s implement one. We start with libraries.

library(tidyverse)
library(tidymodels)
library(hoopR)

set.seed(1234)

Let’s use what we had from the last tutorial – a rolling window of points per possession for team and opponent. I’ve gone ahead and run it all in the background. You can see modelgames by using head in the block.

games <- load_mbb_team_box(seasons = 2015:2024)

nond1 <- games |> group_by(team_id, season) |> tally() |> filter(n < 10) |> select(team_id)
nond1 <- pull(nond1)

df <- games |> filter(!team_id %in% nond1 & !opponent_team_id %in% nond1)

teamside <- df |> 
  group_by(team_short_display_name, season) |> 
  arrange(game_date) |> 
  mutate(
    team_possessions = field_goals_attempted - offensive_rebounds + turnovers + (.475 * free_throws_attempted),
    team_points_per_possession = team_score/team_possessions,
    team_defensive_points_per_possession = opponent_team_score/team_possessions,
    team_offensive_efficiency = team_points_per_possession * 100,
    team_defensive_efficiency = team_defensive_points_per_possession * 100,
    team_season_offensive_efficiency = lag(cummean(team_offensive_efficiency), n=1),
    team_season_defensive_efficiency = lag(cummean(team_defensive_efficiency), n=1),  
    score_margin = team_score - opponent_team_score,
    absolute_score_margin = abs(score_margin)
  ) |> 
  filter(absolute_score_margin <= 40) |> 
  ungroup()

opponentside <- teamside |> 
  select(-opponent_team_id) |> 
  rename(
    opponent_team_id = team_id,
    opponent_season_offensive_efficiency = team_season_offensive_efficiency,
    opponent_season_defensive_efficiency = team_season_defensive_efficiency
  ) |> 
  select(
    game_id,
    opponent_team_id,
    opponent_season_offensive_efficiency,
    opponent_season_defensive_efficiency
  )

bothsides <- teamside |> inner_join(opponentside)

Joining with `by = join_by(game_id, opponent_team_id)`

bothsides <- bothsides |> mutate(
  team_result = as.factor(case_when(
    team_score > opponent_team_score ~ "W",
    opponent_team_score > team_score ~ "L"
)))

bothsides$team_result <- relevel(bothsides$team_result, ref="W")

modelgames <- bothsides |> 
  select(
    game_id, 
    game_date, 
    team_short_display_name, 
    opponent_team_short_display_name, 
    season, 
    team_season_offensive_efficiency,
    team_season_defensive_efficiency,
    opponent_season_offensive_efficiency,
    opponent_season_defensive_efficiency,
    team_result
    ) |> 
  na.omit()

For this tutorial, we’re going to create two models from two workflows so that we can compare a logistic regression to a random forest.

4.2 Setup

A random forest is, as the name implies, a large number of decision trees, and they use a random set of inputs. The algorithm creates a large number of randomly selected training inputs, and randomly chooses the feature input for each branch, creating predictions. The goal is to create uncorrelated forests of trees. The trees all make predictions, and the wisdom of the crowds takes over. In the case of classification algorithm, the most common prediction is the one that gets chosen. In a regression model, the predictions get averaged together.

The random part of random forest is in how the number of tree splits get created and how the samples from the data are taken to generate the splits. They’re randomized, which has the effect of limiting the influence of a particular feature and prevents overfitting – where your predictions are so tailored to your training data that they miss badly on the test data.

For random forests, we change the model type to rand_forest and set the engine to “ranger”. There’s multiple implementations of the random forest algorithm, and the differences between them are beyond the scope of what we’re doing here.

We’re going to go through the steps of modeling again, starting with splitting our modelgames data.

4.2.1 Exercise 1: setting up your data

game_split <- initial_split(??????????, prop = .8)
game_train <- training(game_split)
game_test <- testing(game_split)

For this walkthrough, we’re going to do both a logistic regression and a random forest side by side to show the value of workflows.

The recipe we’ll create is the same for both, so we’ll use it twice.

4.2.2 Exercise 2: setting up the receipe

So what data are we feeding into our recipe?

game_recipe <- 
  recipe(team_result ~ ., data = game_?????) |> 
  update_role(game_id, game_date, team_short_display_name, opponent_team_short_display_name, season, new_role = "ID") |>
  step_normalize(all_predictors())

summary(game_recipe)

# A tibble: 10 × 4
   variable                             type      role      source  
   <chr>                                <list>    <chr>     <chr>   
 1 game_id                              <chr [2]> ID        original
 2 game_date                            <chr [1]> ID        original
 3 team_short_display_name              <chr [3]> ID        original
 4 opponent_team_short_display_name     <chr [3]> ID        original
 5 season                               <chr [2]> ID        original
 6 team_season_offensive_efficiency     <chr [2]> predictor original
 7 team_season_defensive_efficiency     <chr [2]> predictor original
 8 opponent_season_offensive_efficiency <chr [2]> predictor original
 9 opponent_season_defensive_efficiency <chr [2]> predictor original
10 team_result                          <chr [3]> outcome   original

Now, we’re going to create two different model specifications. The first will be the logistic regression model definition and the second will be the random forest.

log_mod <- 
  logistic_reg() |> 
  set_engine("glm") |>
  set_mode("classification")

rf_mod <- 
  rand_forest() |> 
  set_engine("ranger") |>
  set_mode("classification")

Now we have enough for our workflows. We have two models and one recipe.

4.2.3 Exercise 3: making workflows

log_workflow <- 
  workflow() |> 
  add_model(???_mod) |> 
  add_recipe(????_recipe)

rf_workflow <- 
  workflow() |> 
  add_model(??_mod) |> 
  add_recipe(????_recipe)

Now we can fit our models to the data.

4.2.4 Exercise 4: fitting our models

log_fit <- 
  log_workflow |> 
  fit(data = ????_?????)

rf_fit <- 
  rf_workflow |> 
  fit(data = ????_?????)

Now we can bind our predictions to the training data and see how we did.

logpredict <- log_fit |> predict(new_data = game_train) |>
  bind_cols(game_train) 

logpredict <- log_fit |> predict(new_data = game_train, type="prob") |>
  bind_cols(logpredict)

rfpredict <- rf_fit |> predict(new_data = game_train) |>
  bind_cols(game_train) 

rfpredict <- rf_fit |> predict(new_data = game_train, type="prob") |>
  bind_cols(rfpredict)

Now, how did we do? First, let’s look at the logistic regression.

4.2.5 Exercise 5: The first metrics

What prediction dataset do we feed into our metrics?

metrics(??????????, team_result, .pred_class)

# A tibble: 2 × 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary         0.676
2 kap      binary         0.353

Same as last time, the logistic regression model comes in at 68 percent accuracy, and when we expose it to testing data, it remains pretty stable. This is a gigantic hint about what is to come.

How about the random forest?

4.2.6 Exercise 6: Random forest metrics

metrics(??????????, team_result, .pred_class)

# A tibble: 2 × 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary         0.987
2 kap      binary         0.974

Holy buckets! We made a model that’s 99 percent accurate? GET ME TO VEGAS.

Remember: Where a model makes its money is in data that it has never seen before.

First, we look at logistic regression.

logtestpredict <- log_fit |> predict(new_data = game_test) |>
  bind_cols(game_test)

logtestpredict <- log_fit |> predict(new_data = game_test, type="prob") |>
  bind_cols(logtestpredict)

metrics(logtestpredict, team_result, .pred_class)

# A tibble: 2 × 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary         0.684
2 kap      binary         0.368

Just about the same. That’s a robust model.

Now, the inevitable crash with random forests.

rftestpredict <- rf_fit |> predict(new_data = game_test) |>
  bind_cols(game_test)

rftestpredict <- rf_fit |> predict(new_data = game_test, type="prob") |>
  bind_cols(rftestpredict)

metrics(rftestpredict, team_result, .pred_class)

# A tibble: 2 × 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary         0.664
2 kap      binary         0.328

Right at 66 percent. A little bit lower than logistic regression. But did they come to the same answers to get those numbers? No.

logtestpredict |>
  conf_mat(team_result, .pred_class)

          Truth
Prediction    W    L
         W 6701 3052
         L 3079 6576

rftestpredict |>
  conf_mat(team_result, .pred_class)

          Truth
Prediction    W    L
         W 6455 3200
         L 3325 6428

Our two models, based on our very basic feature engineering, are only slightly better than flipping a coin. If we want to get better, we’ve got work to do.