11  XGBoost for regression

11.1 The basics

And now we return to XGBoost, but now for regression. Recall that boosting methods are another wrinkle in the tree based methods. Instead of deep trees, boosting methods intentionally pick shallow trees – called stumps – that, at least initially, do a poor job of predicting the outcome. Then, each subsequent stump takes the job the previous one did, optimizes to reduce the residuals – the gap between prediction and reality – and makes a prediction. And then the next one does the same, and so on and so on.

So far, our linear regression and random forest methods aren’t that great. Does XGBoost, with it’s method of optimizing for reduced error, fare any better? Let’s try, but this time we’re going to add more information. We’re going to add college receiving stats.

As always, we start with libraries.

library(tidyverse)
library(tidymodels)

set.seed(1234)

We’re going to load a new data file this time. It’s called wrdraftedstats and it now has wide receivers who were drafted, their fantasy stats, and their college career stats.

wrdraftedstats <- read_csv("https://mattwaite.github.io/sportsdatafiles/wrdraftedstats20132022.csv")

Let’s narrow that down to just our columns we need. We’re going to add total_yards and total_touchdowns to our data to see what happens to our predictions.

wrselected <- wrdraftedstats %>%
  select(
    name,
    year,
    college_team,
    nfl_team,
    overall,
    total_yards,
    total_touchdowns,
    FantPt
  ) %>% na.omit()

Before we get to the recipe, let’s split our data.

player_split <- initial_split(wrselected, prop = .8)

player_train <- training(player_split)
player_test <- testing(player_split)

11.2 Implementing XGBoost

Our player recipe will remain unchanged because we’re using the . notation to mean “everything that isn’t an ID or a predictor.”

player_recipe <- 
  recipe(FantPt ~ ., data = player_train) %>%
  update_role(name, year, college_team, nfl_team, new_role = "ID")

summary(player_recipe)
# A tibble: 8 × 4
  variable         type    role      source  
  <chr>            <chr>   <chr>     <chr>   
1 name             nominal ID        original
2 year             numeric ID        original
3 college_team     nominal ID        original
4 nfl_team         nominal ID        original
5 overall          numeric predictor original
6 total_yards      numeric predictor original
7 total_touchdowns numeric predictor original
8 FantPt           numeric outcome   original

Our prediction will use the overall draft pick, their total yards in college and their total touchdowns in college. Does that predict Fantasy points better? Let’s implement multiple models side by side.

linear_mod <- 
  linear_reg() %>% 
  set_engine("lm") %>%
  set_mode("regression")

rf_mod <- 
  rand_forest() %>%
  set_engine("ranger") %>%
  set_mode("regression")

xg_mod <- boost_tree(
  trees = tune(), 
  learn_rate = tune(),
  tree_depth = tune(), 
  min_n = tune(),
  loss_reduction = tune(), 
  sample_size = tune(), 
  mtry = tune(), 
  ) %>% 
  set_mode("regression") %>% 
  set_engine("xgboost")

Now to create workflows.

linear_workflow <- 
  workflow() %>% 
  add_model(linear_mod) %>% 
  add_recipe(player_recipe)

rf_workflow <- 
  workflow() %>% 
  add_model(rf_mod) %>% 
  add_recipe(player_recipe)

xg_workflow <- 
  workflow() %>% 
  add_model(xg_mod) %>% 
  add_recipe(player_recipe)

Now to tune the XGBoost model.

xgb_grid <- grid_latin_hypercube(
  trees(),
  tree_depth(),
  min_n(),
  loss_reduction(),
  sample_size = sample_prop(),
  finalize(mtry(), player_train),
  learn_rate()
)

player_folds <- vfold_cv(player_train)

xgb_res <- tune_grid(
  xg_workflow,
  resamples = player_folds,
  grid = xgb_grid,
  control = control_grid(save_pred = TRUE)
)

best_rmse <- select_best(xgb_res, "rmse")

final_xgb <- finalize_workflow(
  xg_workflow,
  best_rmse
)

Because there’s not a ton of data here, this goes relatively quickly. Now to create fits.

linear_fit <- 
  linear_workflow %>% 
  fit(data = player_train)

rf_fit <- 
  rf_workflow %>% 
  fit(data = player_train)

xg_fit <- 
  final_xgb %>% 
  fit(data = player_train)

And now to make predictions.

linearpredict <- 
  linear_fit %>% 
  predict(new_data = player_train) %>%
  bind_cols(player_train) 

rfpredict <- 
  rf_fit %>% 
  predict(new_data = player_train) %>%
  bind_cols(player_train) 

xgpredict <- 
  xg_fit %>% 
  predict(new_data = player_train) %>%
  bind_cols(player_train) 

For your assignment: Interpret the metrics output of each. Compare them. How does each model do relative to each other?

metrics(linearpredict, FantPt, .pred)
# A tibble: 3 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard      43.6  
2 rsq     standard       0.279
3 mae     standard      34.2  
metrics(rfpredict, FantPt, .pred)
# A tibble: 3 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard      24.6  
2 rsq     standard       0.839
3 mae     standard      18.5  
metrics(xgpredict, FantPt, .pred)
# A tibble: 3 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard      38.1  
2 rsq     standard       0.455
3 mae     standard      28.6  

For your assignment: Implement metrics for test. What happens?