library(tidyverse)
library(tidymodels)
library(zoo)
library(hoopR)
set.seed(1234)
5 XGBoost
5.1 The basics
As we learned in the previous chapter, random forests (and bagged methods) average together a large number of trees to get to an answer. Random forests add a wrinkle by randomly choosing features at each branch to make it so each tree is not correlated and the trees are rather deep. The idea behind averaging them together is to cut down on the variance in predictions – random forests tend to be somewhat harder to fit to unseen data because of the variance. Random forests are fairly simple to implement, and are very popular.
Boosting methods are another wrinkle in the tree based methods. Instead of deep trees, boosting methods intentionally pick shallow trees – called stumps – that, at least initially, do a poor job of predicting the outcome. Then, each subsequent stump takes the job the previous one did, optimizes to reduce the residuals – the gap between prediction and reality – and makes a prediction. And then the next one does the same, and so on and so on.
The path to a boosted method is complex, the results can take a lot of your computer’s time, but the models are more generalizable, meaning they handle new data better than other methods. Among data scientists, boosted methods, such as xgboost and lightgbm, are very popular for solving a wide variety of problems.
Let’s re-implement our predictions in an XGBoost algorithm. First, we’ll load libraries.
We’ll load our game data and do a spot of feature engineering that we used with our other models.
<- load_mbb_team_box(seasons = 2015:2024)
teamgames
<- teamgames |>
teamstats mutate(
possessions = field_goals_attempted - offensive_rebounds + turnovers + (.475 * free_throws_attempted),
ppp = team_score/possessions,
oppp = opponent_team_score/possessions
)
<- teamstats |>
rollingteamstats group_by(team_short_display_name, season) |>
arrange(game_date) |>
mutate(
team_rolling_ppp = rollmean(lag(ppp, n=1), k=5, align="right", fill=NA),
team_rolling_oppp = rollmean(lag(oppp, n=1), k=5, align="right", fill=NA)
|>
) ungroup()
<- rollingteamstats |>
team_side select(
game_id,
team_id,
team_short_display_name,
opponent_team_id,
game_date,
season,
team_score,
team_rolling_ppp,
team_rolling_oppp
)
<- team_side |>
opponent_side select(-opponent_team_id) |>
rename(
opponent_team_id = team_id,
opponent_short_display_name = team_short_display_name,
opponent_score = team_score,
opponent_rolling_ppp = team_rolling_ppp,
opponent_rolling_oppp = team_rolling_oppp
|>
) mutate(opponent_id = as.numeric(opponent_team_id)
)
<- team_side |> inner_join(opponent_side)
games
<- games |> mutate(
games team_result = as.factor(case_when(
> opponent_score ~ "W",
team_score > team_score ~ "L"
opponent_score
)))
$team_result <- relevel(games$team_result, ref="W")
games
<- games |>
modelgames select(
game_id,
game_date,
team_short_display_name,
opponent_short_display_name,
season,
team_rolling_ppp,
team_rolling_oppp,
opponent_rolling_ppp,
opponent_rolling_oppp,
team_result|>
) na.omit()
Per usual, we split our data into training and testing.
<- initial_split(modelgames, prop = .8)
game_split <- training(game_split)
game_train <- testing(game_split) game_test
And our recipe.
<-
game_recipe recipe(team_result ~ ., data = game_train) |>
update_role(game_id, game_date, team_short_display_name, opponent_short_display_name, season, new_role = "ID")
summary(game_recipe)
# A tibble: 10 × 4
variable type role source
<chr> <list> <chr> <chr>
1 game_id <chr [2]> ID original
2 game_date <chr [1]> ID original
3 team_short_display_name <chr [3]> ID original
4 opponent_short_display_name <chr [3]> ID original
5 season <chr [2]> ID original
6 team_rolling_ppp <chr [2]> predictor original
7 team_rolling_oppp <chr [2]> predictor original
8 opponent_rolling_ppp <chr [2]> predictor original
9 opponent_rolling_oppp <chr [2]> predictor original
10 team_result <chr [3]> outcome original
To this point, everything looks like what we’ve done before. Nothing has really changed. It’s about to.
5.2 Hyperparameters
The hyperparameters are the inputs into the algorithm that make the fit. To find the ideal hyperparameters, you need to tune them. But first, let’s talk about the hyperparameters we are going to tune (there are others we can, but every addition to the number of hyperparameters means more computation):
- mtry – the number of predictors that will be randomly sampled at each split when making trees.
- Learn rate – this controls how fast the algorithm goes down the gradient descent – how fast it learns. Too fast and you’ll overshoot the optimal stopping point and start going up the error curve. Too slow and you’ll never get to the optimal stopping point.
- Tree depth – controls the depth of each individual tree. Too short and you’ll need a lot of them to get good results. Too deep and you risk overfitting.
- Minimum number of observations in the terminal node (min_n) – controls the complexity of each tree. Typical values range from 5-15, and higher values keep a model from figuring out relationships that are unique to that training set (ie overfitting).
- Loss reduction – this is the minimum loss reduction to make a new tree split. If the improvement hits this minimum, a split occurs. A low value and you get a complex tree. High value and you get a tree more robust to new data, but it’s more conservative.
Others you can tune, but have sensible defaults:
- Number of trees – this is the total number of trees in the sequence. A gradient boosting algorithm will minimize residuals forever, so you need to tell it where to stop. That stopping point is different for every problem. You can tune this, but I’ll warn you – this is the most computationally expensive tuning. For our example, we’re going to set the number of trees at 30. The default is 15. Tuning it can take a good computer more than an hour to complete, and you’ll have gained .1 percent of accuracy. If we’re Amazon and billions of dollars are on the line, it’s worth it. For predicting NCAA tournament games, it is not.
- Sample size – The fraction of the total training set that can be used for each boosting round. Low values may lead to underfitting, high to overfitting.
All of these combine to make the model, and each has their own specific ideal. How do we find it? Tuning.
5.2.1 Exercise 1: tuning
First, we make a model and label each parameter as tune()
<- boost_tree(
xg_mod trees = 30,
mtry = tune(),
learn_rate = ????(),
tree_depth = ????(),
min_n = ????(),
loss_reduction = ????()
|>
) set_mode("classification") |>
set_engine("xgboost")
5.2.2 Exercise 2: making a workflow
Let’s make a workflow now that we have our recipe and our model.
<-
game_wflow workflow() |>
add_model(??_???) |>
add_recipe(????_recipe)
Now, to tune the model, we have to create a grid. The grid is essentially a random sample of parameters to try. The latin hypercube is a method of creating a near-random sample of parameter values in multidimentional distributions (ie there’s more than one predictor). The latin hypercube is near-random because there has to be one sample in each row and column of the hypercube. Essentially, it removes the possibility of totally empty spaces in the cube. Why is that important? Because this hypercube is how your tuning is going to find the optimal outputs.
What follows is what parameters the hypercube will tune.
<- grid_latin_hypercube(
xgb_grid finalize(mtry(), game_train),
tree_depth(),
min_n(),
loss_reduction(),
learn_rate(),
size = 30
)
xgb_grid
# A tibble: 30 × 5
mtry tree_depth min_n loss_reduction learn_rate
<int> <int> <int> <dbl> <dbl>
1 3 6 10 4.14e-3 1.84e- 4
2 2 9 16 1.27e-4 6.39e- 3
3 10 2 23 1.02e-5 9.98e-10
4 5 12 20 4.56e-7 5.62e-10
5 9 8 33 1.79e+1 6.33e- 2
6 6 8 15 1.19e-9 2.62e- 3
7 5 5 7 1.44e-1 7.97e- 5
8 7 12 28 3.41e-8 4.06e- 5
9 1 13 25 1.04e+1 1.29e-10
10 4 5 6 1.37e+0 6.46e- 7
# ℹ 20 more rows
How do we tune it? Using something called cross fold validation. Cross fold validation takes our grid, applies it to a set of subsets (in our case 10 subsets) and compares. It’ll take a random square in the hypercube, try the combinations in there, and see what happens. It’ll then keep doing that, over and over and over and over. When it’s done, each validation set will have a set of tuned values and outcomes that we can evaluate and pick the optimal set to get a result.
5.2.3 Exercise 3: creating our cross-fold validation set
This will create the folds, which are just 10 random subsets of the training data.
<- vfold_cv(game_?????)
game_folds
game_folds
# 10-fold cross-validation
# A tibble: 10 × 2
splits id
<list> <chr>
1 <split [60703/6745]> Fold01
2 <split [60703/6745]> Fold02
3 <split [60703/6745]> Fold03
4 <split [60703/6745]> Fold04
5 <split [60703/6745]> Fold05
6 <split [60703/6745]> Fold06
7 <split [60703/6745]> Fold07
8 <split [60703/6745]> Fold08
9 <split [60704/6744]> Fold09
10 <split [60704/6744]> Fold10
Now we come to the part that is going to take some time on your computer. How long? It depends. On my old 2018 Intel Mac, this part took about 25-30 minutes on my machine and made it sounds like it was attempting liftoff. My 2022 M1 Mac? Right around 10 minutes. Depending on how new and how powerful you computer is, it could take minutes, or it could take hours. The point being, start it up, walk away, and let it burn.
<- tune_grid(
xgb_res
game_wflow,resamples = game_folds,
grid = xgb_grid,
control = control_grid(save_pred = TRUE)
)
Our grid has run on all of our validation samples, and what do we see?
collect_metrics(xgb_res)
# A tibble: 60 × 11
mtry min_n tree_depth learn_rate loss_reduction .metric .estimator mean
<int> <int> <int> <dbl> <dbl> <chr> <chr> <dbl>
1 3 10 6 1.84e- 4 0.00414 accuracy binary 0.628
2 3 10 6 1.84e- 4 0.00414 roc_auc binary 0.680
3 2 16 9 6.39e- 3 0.000127 accuracy binary 0.633
4 2 16 9 6.39e- 3 0.000127 roc_auc binary 0.687
5 10 23 2 9.98e-10 0.0000102 accuracy binary 0.500
6 10 23 2 9.98e-10 0.0000102 roc_auc binary 0.5
7 5 20 12 5.62e-10 0.000000456 accuracy binary 0.500
8 5 20 12 5.62e-10 0.000000456 roc_auc binary 0.5
9 9 33 8 6.33e- 2 17.9 accuracy binary 0.636
10 9 33 8 6.33e- 2 17.9 roc_auc binary 0.689
# ℹ 50 more rows
# ℹ 3 more variables: n <int>, std_err <dbl>, .config <chr>
Well we see 60 combinations and the metrics from them. But that doesn’t mean much to us just eyeballing it. We want to see the best combination. There’s a function to just show us the best one called … wait for it … show_best
.
show_best(xgb_res, "accuracy")
# A tibble: 5 × 11
mtry min_n tree_depth learn_rate loss_reduction .metric .estimator mean
<int> <int> <int> <dbl> <dbl> <chr> <chr> <dbl>
1 9 33 8 0.0633 17.9 accuracy binary 0.636
2 1 34 11 0.0170 0.0000000846 accuracy binary 0.635
3 2 16 9 0.00639 0.000127 accuracy binary 0.633
4 2 30 11 0.0276 0.00000530 accuracy binary 0.633
5 2 20 7 0.0000202 0.0372 accuracy binary 0.632
# ℹ 3 more variables: n <int>, std_err <dbl>, .config <chr>
The best combination as of this data update comes up with an accuracy of about 61.5 percent.
5.3 Finalizing our model
Let’s capture our best set of hyperparameters so we can use them in our model.
<- select_best(xgb_res, "accuracy") best_acc
And now we put that into a final workflow. Pay attention to the main arguments in the output below.
<- finalize_workflow(
final_xgb
game_wflow,
best_acc
)
final_xgb
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: boost_tree()
── Preprocessor ────────────────────────────────────────────────────────────────
0 Recipe Steps
── Model ───────────────────────────────────────────────────────────────────────
Boosted Tree Model Specification (classification)
Main Arguments:
mtry = 9
trees = 30
min_n = 33
tree_depth = 8
learn_rate = 0.0633417088937915
loss_reduction = 17.9327965928095
Computational engine: xgboost
There’s our best set of hyperparameters. We’ve tuned this model to give the best possible set of results in those settings. Now we apply it like we have been doing all along.
5.3.1 Exercise 4: making our final workflow
We create a fit using our finalized workflow.
<-
xg_fit |>
?????_xgb fit(data = game_train)
We can see something things about that fit, including all the iterations of our XGBoost model. Remember: Boosted models work sequentially. One after the other. So you can see it at work. The error goes down with each iteration as we go down the gradient descent.
|>
xg_fit extract_fit_parsnip()
parsnip model object
##### xgb.Booster
raw: 292.2 Kb
call:
xgboost::xgb.train(params = list(eta = 0.0633417088937915, max_depth = 8L,
gamma = 17.9327965928095, colsample_bytree = 1, colsample_bynode = 1,
min_child_weight = 33L, subsample = 1), data = x$data, nrounds = 30,
watchlist = x$watchlist, verbose = 0, nthread = 1, objective = "binary:logistic")
params (as set within xgb.train):
eta = "0.0633417088937915", max_depth = "8", gamma = "17.9327965928095", colsample_bytree = "1", colsample_bynode = "1", min_child_weight = "33", subsample = "1", nthread = "1", objective = "binary:logistic", validate_parameters = "TRUE"
xgb.attributes:
niter
callbacks:
cb.evaluation.log()
# of features: 4
niter: 30
nfeatures : 4
evaluation_log:
iter training_logloss
1 0.6866136
2 0.6808034
---
29 0.6308232
30 0.6303655
5.3.2 Exercise 5: prediction time
Now, like before, we can bind our predictions using our xg_fit to the game_train data.
<- game_train |>
trainresults bind_cols(predict(??_???, game_train))
5.3.3 Exercise 6: metrics
And now see how we did.
metrics(????????????, truth = team_result, estimate = .pred_class)
# A tibble: 2 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 accuracy binary 0.646
2 kap binary 0.293
How about the test data?
<- game_test |>
testresults bind_cols(predict(xg_fit, game_test))
metrics(testresults, truth = team_result, estimate = .pred_class)
# A tibble: 2 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 accuracy binary 0.632
2 kap binary 0.264
Unlike the random forest, not nearly the drop in metrics between train and test.