library(tidyverse)
library(tidymodels)
library(hoopR)
set.seed(1234)
4 Decision trees and random forests
4.1 The basics
Tree-based algorithms are based on decision trees, which are very easy to understand. A decision tree can basically be described as a series of questions. Does this player have more or less than x seasons of experience? Do they have more or less then y minutes played? Do they play this or that position? Answer enough questions, and you can predict what that player should have on average.
The upside of decision trees is that if the model is small, you can explain it to anyone. They’re very easy to understand. The trouble with decision trees is that if the model is small, they’re a bit of a crude instrument. As such, multiple tree based methods have been developed as improvements on the humble decision tree.
The most common is the random forest.
Let’s implement one. We start with libraries.
Let’s use what we had from the last tutorial – a rolling window of points per possession for team and opponent. I’ve gone ahead and run it all in the background. You can see modelgames by using head
in the block.
<- load_mbb_team_box(seasons = 2015:2024)
games
<- games |> group_by(team_id, season) |> tally() |> filter(n < 10) |> select(team_id)
nond1 <- pull(nond1)
nond1
<- games |> filter(!team_id %in% nond1 & !opponent_team_id %in% nond1)
df
<- df |>
teamside group_by(team_short_display_name, season) |>
arrange(game_date) |>
mutate(
team_possessions = field_goals_attempted - offensive_rebounds + turnovers + (.475 * free_throws_attempted),
team_points_per_possession = team_score/team_possessions,
team_defensive_points_per_possession = opponent_team_score/team_possessions,
team_offensive_efficiency = team_points_per_possession * 100,
team_defensive_efficiency = team_defensive_points_per_possession * 100,
team_season_offensive_efficiency = lag(cummean(team_offensive_efficiency), n=1),
team_season_defensive_efficiency = lag(cummean(team_defensive_efficiency), n=1),
score_margin = team_score - opponent_team_score,
absolute_score_margin = abs(score_margin)
|>
) filter(absolute_score_margin <= 40) |>
ungroup()
<- teamside |>
opponentside select(-opponent_team_id) |>
rename(
opponent_team_id = team_id,
opponent_season_offensive_efficiency = team_season_offensive_efficiency,
opponent_season_defensive_efficiency = team_season_defensive_efficiency
|>
) select(
game_id,
opponent_team_id,
opponent_season_offensive_efficiency,
opponent_season_defensive_efficiency
)
<- teamside |> inner_join(opponentside) bothsides
Joining with `by = join_by(game_id, opponent_team_id)`
<- bothsides |> mutate(
bothsides team_result = as.factor(case_when(
> opponent_team_score ~ "W",
team_score > team_score ~ "L"
opponent_team_score
)))
$team_result <- relevel(bothsides$team_result, ref="W")
bothsides
<- bothsides |>
modelgames select(
game_id,
game_date,
team_short_display_name,
opponent_team_short_display_name,
season,
team_season_offensive_efficiency,
team_season_defensive_efficiency,
opponent_season_offensive_efficiency,
opponent_season_defensive_efficiency,
team_result|>
) na.omit()
For this tutorial, we’re going to create two models from two workflows so that we can compare a logistic regression to a random forest.
4.2 Setup
A random forest is, as the name implies, a large number of decision trees, and they use a random set of inputs. The algorithm creates a large number of randomly selected training inputs, and randomly chooses the feature input for each branch, creating predictions. The goal is to create uncorrelated forests of trees. The trees all make predictions, and the wisdom of the crowds takes over. In the case of classification algorithm, the most common prediction is the one that gets chosen. In a regression model, the predictions get averaged together.
The random part of random forest is in how the number of tree splits get created and how the samples from the data are taken to generate the splits. They’re randomized, which has the effect of limiting the influence of a particular feature and prevents overfitting – where your predictions are so tailored to your training data that they miss badly on the test data.
For random forests, we change the model type to rand_forest and set the engine to “ranger”. There’s multiple implementations of the random forest algorithm, and the differences between them are beyond the scope of what we’re doing here.
We’re going to go through the steps of modeling again, starting with splitting our modelgames
data.
4.2.1 Exercise 1: setting up your data
<- initial_split(??????????, prop = .8)
game_split <- training(game_split)
game_train <- testing(game_split) game_test
For this walkthrough, we’re going to do both a logistic regression and a random forest side by side to show the value of workflows.
The recipe we’ll create is the same for both, so we’ll use it twice.
4.2.2 Exercise 2: setting up the receipe
So what data are we feeding into our recipe?
<-
game_recipe recipe(team_result ~ ., data = game_?????) |>
update_role(game_id, game_date, team_short_display_name, opponent_team_short_display_name, season, new_role = "ID") |>
step_normalize(all_predictors())
summary(game_recipe)
# A tibble: 10 × 4
variable type role source
<chr> <list> <chr> <chr>
1 game_id <chr [2]> ID original
2 game_date <chr [1]> ID original
3 team_short_display_name <chr [3]> ID original
4 opponent_team_short_display_name <chr [3]> ID original
5 season <chr [2]> ID original
6 team_season_offensive_efficiency <chr [2]> predictor original
7 team_season_defensive_efficiency <chr [2]> predictor original
8 opponent_season_offensive_efficiency <chr [2]> predictor original
9 opponent_season_defensive_efficiency <chr [2]> predictor original
10 team_result <chr [3]> outcome original
Now, we’re going to create two different model specifications. The first will be the logistic regression model definition and the second will be the random forest.
<-
log_mod logistic_reg() |>
set_engine("glm") |>
set_mode("classification")
<-
rf_mod rand_forest() |>
set_engine("ranger") |>
set_mode("classification")
Now we have enough for our workflows. We have two models and one recipe.
4.2.3 Exercise 3: making workflows
<-
log_workflow workflow() |>
add_model(???_mod) |>
add_recipe(????_recipe)
<-
rf_workflow workflow() |>
add_model(??_mod) |>
add_recipe(????_recipe)
Now we can fit our models to the data.
4.2.4 Exercise 4: fitting our models
<-
log_fit |>
log_workflow fit(data = ????_?????)
<-
rf_fit |>
rf_workflow fit(data = ????_?????)
Now we can bind our predictions to the training data and see how we did.
<- log_fit |> predict(new_data = game_train) |>
logpredict bind_cols(game_train)
<- log_fit |> predict(new_data = game_train, type="prob") |>
logpredict bind_cols(logpredict)
<- rf_fit |> predict(new_data = game_train) |>
rfpredict bind_cols(game_train)
<- rf_fit |> predict(new_data = game_train, type="prob") |>
rfpredict bind_cols(rfpredict)
Now, how did we do? First, let’s look at the logistic regression.
4.2.5 Exercise 5: The first metrics
What prediction dataset do we feed into our metrics?
metrics(??????????, team_result, .pred_class)
# A tibble: 2 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 accuracy binary 0.676
2 kap binary 0.353
Same as last time, the logistic regression model comes in at 68 percent accuracy, and when we expose it to testing data, it remains pretty stable. This is a gigantic hint about what is to come.
How about the random forest?
4.2.6 Exercise 6: Random forest metrics
metrics(??????????, team_result, .pred_class)
# A tibble: 2 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 accuracy binary 0.987
2 kap binary 0.974
Holy buckets! We made a model that’s 99 percent accurate? GET ME TO VEGAS.
Remember: Where a model makes its money is in data that it has never seen before.
First, we look at logistic regression.
<- log_fit |> predict(new_data = game_test) |>
logtestpredict bind_cols(game_test)
<- log_fit |> predict(new_data = game_test, type="prob") |>
logtestpredict bind_cols(logtestpredict)
metrics(logtestpredict, team_result, .pred_class)
# A tibble: 2 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 accuracy binary 0.684
2 kap binary 0.368
Just about the same. That’s a robust model.
Now, the inevitable crash with random forests.
<- rf_fit |> predict(new_data = game_test) |>
rftestpredict bind_cols(game_test)
<- rf_fit |> predict(new_data = game_test, type="prob") |>
rftestpredict bind_cols(rftestpredict)
metrics(rftestpredict, team_result, .pred_class)
# A tibble: 2 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 accuracy binary 0.664
2 kap binary 0.328
Right at 66 percent. A little bit lower than logistic regression. But did they come to the same answers to get those numbers? No.
|>
logtestpredict conf_mat(team_result, .pred_class)
Truth
Prediction W L
W 6701 3052
L 3079 6576
|>
rftestpredict conf_mat(team_result, .pred_class)
Truth
Prediction W L
W 6455 3200
L 3325 6428
Our two models, based on our very basic feature engineering, are only slightly better than flipping a coin. If we want to get better, we’ve got work to do.