5  XGBoost

5.1 The basics

As we learned in the previous chapter, random forests (and bagged methods) average together a large number of trees to get to an answer. Random forests add a wrinkle by randomly choosing features at each branch to make it so each tree is not correlated and the trees are rather deep. The idea behind averaging them together is to cut down on the variance in predictions – random forests tend to be somewhat harder to fit to unseen data because of the variance. Random forests are fairly simple to implement, and are very popular.

Boosting methods are another wrinkle in the tree based methods. Instead of deep trees, boosting methods intentionally pick shallow trees – called stumps – that, at least initially, do a poor job of predicting the outcome. Then, each subsequent stump takes the job the previous one did, optimizes to reduce the residuals – the gap between prediction and reality – and makes a prediction. And then the next one does the same, and so on and so on.

The path to a boosted method is complex, the results can take a lot of your computer’s time, but the models are more generalizable, meaning they handle new data better than other methods. Among data scientists, boosted methods, such as xgboost and lightgbm, are very popular for solving a wide variety of problems.

Let’s re-implement our predictions in an XGBoost algorithm. First, we’ll load libraries.

library(tidyverse)
library(tidymodels)
library(zoo)
library(hoopR)

set.seed(1234)

We’ll load our game data and do a spot of feature engineering that we used with our other models.

teamgames <- load_mbb_team_box(seasons = 2015:2024)

teamstats <- teamgames |> 
  mutate(
    possessions = field_goals_attempted - offensive_rebounds + turnovers + (.475 * free_throws_attempted),
    ppp = team_score/possessions,
    oppp = opponent_team_score/possessions
  )

rollingteamstats <- teamstats |> 
  group_by(team_short_display_name, season) |>
  arrange(game_date) |>
  mutate(
    team_rolling_ppp = rollmean(lag(ppp, n=1), k=5, align="right", fill=NA),
    team_rolling_oppp = rollmean(lag(oppp, n=1), k=5, align="right", fill=NA)
    ) |> 
  ungroup()

team_side <- rollingteamstats |>
  select(
    game_id,
    team_id, 
    team_short_display_name, 
    opponent_team_id, 
    game_date, 
    season, 
    team_score, 
    team_rolling_ppp,
    team_rolling_oppp
    )

opponent_side <- team_side |>
  select(-opponent_team_id) |> 
  rename(
    opponent_team_id = team_id,
    opponent_short_display_name = team_short_display_name,
    opponent_score = team_score,
    opponent_rolling_ppp = team_rolling_ppp,
    opponent_rolling_oppp = team_rolling_oppp
  ) |>
  mutate(opponent_id = as.numeric(opponent_team_id)
)

games <- team_side |> inner_join(opponent_side)

games <- games |> mutate(
  team_result = as.factor(case_when(
    team_score > opponent_score ~ "W",
    opponent_score > team_score ~ "L"
)))

games$team_result <- relevel(games$team_result, ref="W")

modelgames <- games |> 
  select(
    game_id, 
    game_date, 
    team_short_display_name, 
    opponent_short_display_name, 
    season, 
    team_rolling_ppp, 
    team_rolling_oppp,
    opponent_rolling_ppp,
    opponent_rolling_oppp,
    team_result
    ) |>
  na.omit()

Per usual, we split our data into training and testing.

game_split <- initial_split(modelgames, prop = .8)
game_train <- training(game_split)
game_test <- testing(game_split)

And our recipe.

game_recipe <- 
  recipe(team_result ~ ., data = game_train) |> 
  update_role(game_id, game_date, team_short_display_name, opponent_short_display_name, season, new_role = "ID")

summary(game_recipe)
# A tibble: 10 × 4
   variable                    type      role      source  
   <chr>                       <list>    <chr>     <chr>   
 1 game_id                     <chr [2]> ID        original
 2 game_date                   <chr [1]> ID        original
 3 team_short_display_name     <chr [3]> ID        original
 4 opponent_short_display_name <chr [3]> ID        original
 5 season                      <chr [2]> ID        original
 6 team_rolling_ppp            <chr [2]> predictor original
 7 team_rolling_oppp           <chr [2]> predictor original
 8 opponent_rolling_ppp        <chr [2]> predictor original
 9 opponent_rolling_oppp       <chr [2]> predictor original
10 team_result                 <chr [3]> outcome   original

To this point, everything looks like what we’ve done before. Nothing has really changed. It’s about to.

5.2 Hyperparameters

The hyperparameters are the inputs into the algorithm that make the fit. To find the ideal hyperparameters, you need to tune them. But first, let’s talk about the hyperparameters we are going to tune (there are others we can, but every addition to the number of hyperparameters means more computation):

  • mtry – the number of predictors that will be randomly sampled at each split when making trees.
  • Learn rate – this controls how fast the algorithm goes down the gradient descent – how fast it learns. Too fast and you’ll overshoot the optimal stopping point and start going up the error curve. Too slow and you’ll never get to the optimal stopping point.
  • Tree depth – controls the depth of each individual tree. Too short and you’ll need a lot of them to get good results. Too deep and you risk overfitting.
  • Minimum number of observations in the terminal node (min_n) – controls the complexity of each tree. Typical values range from 5-15, and higher values keep a model from figuring out relationships that are unique to that training set (ie overfitting).
  • Loss reduction – this is the minimum loss reduction to make a new tree split. If the improvement hits this minimum, a split occurs. A low value and you get a complex tree. High value and you get a tree more robust to new data, but it’s more conservative.

Others you can tune, but have sensible defaults:

  • Number of trees – this is the total number of trees in the sequence. A gradient boosting algorithm will minimize residuals forever, so you need to tell it where to stop. That stopping point is different for every problem. You can tune this, but I’ll warn you – this is the most computationally expensive tuning. For our example, we’re going to set the number of trees at 30. The default is 15. Tuning it can take a good computer more than an hour to complete, and you’ll have gained .1 percent of accuracy. If we’re Amazon and billions of dollars are on the line, it’s worth it. For predicting NCAA tournament games, it is not.
  • Sample size – The fraction of the total training set that can be used for each boosting round. Low values may lead to underfitting, high to overfitting.

All of these combine to make the model, and each has their own specific ideal. How do we find it? Tuning.

5.2.1 Exercise 1: tuning

First, we make a model and label each parameter as tune()

xg_mod <- boost_tree(
  trees = 30,
  mtry = tune(),
  learn_rate = ????(),
  tree_depth = ????(), 
  min_n = ????(),
  loss_reduction = ????()
  ) |> 
  set_mode("classification") |> 
  set_engine("xgboost")

5.2.2 Exercise 2: making a workflow

Let’s make a workflow now that we have our recipe and our model.

game_wflow <- 
  workflow() |> 
  add_model(??_???) |> 
  add_recipe(????_recipe)

Now, to tune the model, we have to create a grid. The grid is essentially a random sample of parameters to try. The latin hypercube is a method of creating a near-random sample of parameter values in multidimentional distributions (ie there’s more than one predictor). The latin hypercube is near-random because there has to be one sample in each row and column of the hypercube. Essentially, it removes the possibility of totally empty spaces in the cube. Why is that important? Because this hypercube is how your tuning is going to find the optimal outputs.

What follows is what parameters the hypercube will tune.

xgb_grid <- grid_latin_hypercube(
  finalize(mtry(), game_train),
  tree_depth(),
  min_n(),
  loss_reduction(),
  learn_rate(),
  size = 30
)

xgb_grid
# A tibble: 30 × 5
    mtry tree_depth min_n loss_reduction learn_rate
   <int>      <int> <int>          <dbl>      <dbl>
 1     3          6    10        4.14e-3   1.84e- 4
 2     2          9    16        1.27e-4   6.39e- 3
 3    10          2    23        1.02e-5   9.98e-10
 4     5         12    20        4.56e-7   5.62e-10
 5     9          8    33        1.79e+1   6.33e- 2
 6     6          8    15        1.19e-9   2.62e- 3
 7     5          5     7        1.44e-1   7.97e- 5
 8     7         12    28        3.41e-8   4.06e- 5
 9     1         13    25        1.04e+1   1.29e-10
10     4          5     6        1.37e+0   6.46e- 7
# ℹ 20 more rows

How do we tune it? Using something called cross fold validation. Cross fold validation takes our grid, applies it to a set of subsets (in our case 10 subsets) and compares. It’ll take a random square in the hypercube, try the combinations in there, and see what happens. It’ll then keep doing that, over and over and over and over. When it’s done, each validation set will have a set of tuned values and outcomes that we can evaluate and pick the optimal set to get a result.

5.2.3 Exercise 3: creating our cross-fold validation set

This will create the folds, which are just 10 random subsets of the training data.

game_folds <- vfold_cv(game_?????)

game_folds
#  10-fold cross-validation 
# A tibble: 10 × 2
   splits               id    
   <list>               <chr> 
 1 <split [60703/6745]> Fold01
 2 <split [60703/6745]> Fold02
 3 <split [60703/6745]> Fold03
 4 <split [60703/6745]> Fold04
 5 <split [60703/6745]> Fold05
 6 <split [60703/6745]> Fold06
 7 <split [60703/6745]> Fold07
 8 <split [60703/6745]> Fold08
 9 <split [60704/6744]> Fold09
10 <split [60704/6744]> Fold10

Now we come to the part that is going to take some time on your computer. How long? It depends. On my old 2018 Intel Mac, this part took about 25-30 minutes on my machine and made it sounds like it was attempting liftoff. My 2022 M1 Mac? Right around 10 minutes. Depending on how new and how powerful you computer is, it could take minutes, or it could take hours. The point being, start it up, walk away, and let it burn.

xgb_res <- tune_grid(
  game_wflow,
  resamples = game_folds,
  grid = xgb_grid,
  control = control_grid(save_pred = TRUE)
)

Our grid has run on all of our validation samples, and what do we see?

collect_metrics(xgb_res)
# A tibble: 60 × 11
    mtry min_n tree_depth learn_rate loss_reduction .metric  .estimator  mean
   <int> <int>      <int>      <dbl>          <dbl> <chr>    <chr>      <dbl>
 1     3    10          6   1.84e- 4    0.00414     accuracy binary     0.628
 2     3    10          6   1.84e- 4    0.00414     roc_auc  binary     0.680
 3     2    16          9   6.39e- 3    0.000127    accuracy binary     0.633
 4     2    16          9   6.39e- 3    0.000127    roc_auc  binary     0.687
 5    10    23          2   9.98e-10    0.0000102   accuracy binary     0.500
 6    10    23          2   9.98e-10    0.0000102   roc_auc  binary     0.5  
 7     5    20         12   5.62e-10    0.000000456 accuracy binary     0.500
 8     5    20         12   5.62e-10    0.000000456 roc_auc  binary     0.5  
 9     9    33          8   6.33e- 2   17.9         accuracy binary     0.636
10     9    33          8   6.33e- 2   17.9         roc_auc  binary     0.689
# ℹ 50 more rows
# ℹ 3 more variables: n <int>, std_err <dbl>, .config <chr>

Well we see 60 combinations and the metrics from them. But that doesn’t mean much to us just eyeballing it. We want to see the best combination. There’s a function to just show us the best one called … wait for it … show_best.

show_best(xgb_res, "accuracy")
# A tibble: 5 × 11
   mtry min_n tree_depth learn_rate loss_reduction .metric  .estimator  mean
  <int> <int>      <int>      <dbl>          <dbl> <chr>    <chr>      <dbl>
1     9    33          8  0.0633     17.9          accuracy binary     0.636
2     1    34         11  0.0170      0.0000000846 accuracy binary     0.635
3     2    16          9  0.00639     0.000127     accuracy binary     0.633
4     2    30         11  0.0276      0.00000530   accuracy binary     0.633
5     2    20          7  0.0000202   0.0372       accuracy binary     0.632
# ℹ 3 more variables: n <int>, std_err <dbl>, .config <chr>

The best combination as of this data update comes up with an accuracy of about 61.5 percent.

5.3 Finalizing our model

Let’s capture our best set of hyperparameters so we can use them in our model.

best_acc <- select_best(xgb_res, "accuracy")

And now we put that into a final workflow. Pay attention to the main arguments in the output below.

final_xgb <- finalize_workflow(
  game_wflow,
  best_acc
)

final_xgb
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: boost_tree()

── Preprocessor ────────────────────────────────────────────────────────────────
0 Recipe Steps

── Model ───────────────────────────────────────────────────────────────────────
Boosted Tree Model Specification (classification)

Main Arguments:
  mtry = 9
  trees = 30
  min_n = 33
  tree_depth = 8
  learn_rate = 0.0633417088937915
  loss_reduction = 17.9327965928095

Computational engine: xgboost 

There’s our best set of hyperparameters. We’ve tuned this model to give the best possible set of results in those settings. Now we apply it like we have been doing all along.

5.3.1 Exercise 4: making our final workflow

We create a fit using our finalized workflow.

xg_fit <- 
  ?????_xgb |> 
  fit(data = game_train)

We can see something things about that fit, including all the iterations of our XGBoost model. Remember: Boosted models work sequentially. One after the other. So you can see it at work. The error goes down with each iteration as we go down the gradient descent.

xg_fit |> 
  extract_fit_parsnip() 
parsnip model object

##### xgb.Booster
raw: 292.2 Kb 
call:
  xgboost::xgb.train(params = list(eta = 0.0633417088937915, max_depth = 8L, 
    gamma = 17.9327965928095, colsample_bytree = 1, colsample_bynode = 1, 
    min_child_weight = 33L, subsample = 1), data = x$data, nrounds = 30, 
    watchlist = x$watchlist, verbose = 0, nthread = 1, objective = "binary:logistic")
params (as set within xgb.train):
  eta = "0.0633417088937915", max_depth = "8", gamma = "17.9327965928095", colsample_bytree = "1", colsample_bynode = "1", min_child_weight = "33", subsample = "1", nthread = "1", objective = "binary:logistic", validate_parameters = "TRUE"
xgb.attributes:
  niter
callbacks:
  cb.evaluation.log()
# of features: 4 
niter: 30
nfeatures : 4 
evaluation_log:
    iter training_logloss
       1        0.6866136
       2        0.6808034
---                      
      29        0.6308232
      30        0.6303655

5.3.2 Exercise 5: prediction time

Now, like before, we can bind our predictions using our xg_fit to the game_train data.

trainresults <- game_train |>
  bind_cols(predict(??_???, game_train))

5.3.3 Exercise 6: metrics

And now see how we did.

metrics(????????????, truth = team_result, estimate = .pred_class)
# A tibble: 2 × 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary         0.646
2 kap      binary         0.293

How about the test data?

testresults <- game_test |>
  bind_cols(predict(xg_fit, game_test))

metrics(testresults, truth = team_result, estimate = .pred_class)
# A tibble: 2 × 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary         0.632
2 kap      binary         0.264

Unlike the random forest, not nearly the drop in metrics between train and test.