library(tidyverse)
library(tidymodels)
set.seed(1234)11 XGBoost for regression
11.1 The basics
And now we return to XGBoost, but now for regression. Recall that boosting methods are another wrinkle in the tree based methods. Instead of deep trees, boosting methods intentionally pick shallow trees – called stumps – that, at least initially, do a poor job of predicting the outcome. Then, each subsequent stump takes the job the previous one did, optimizes to reduce the residuals – the gap between prediction and reality – and makes a prediction. And then the next one does the same, and so on and so on.
So far, our linear regression and random forest methods aren’t that great. Does XGBoost, with it’s method of optimizing for reduced error, fare any better? Let’s try, but this time we’re going to add more information. We’re going to add college receiving stats.
As always, we start with libraries.
We’re going to load a new data file this time. It’s called wrdraftedstats and it now has wide receivers who were drafted, their fantasy stats, and their college career stats.
wrdraftedstats <- read_csv("https://mattwaite.github.io/sportsdatafiles/wrdraftedstats20132022.csv")Let’s narrow that down to just our columns we need. We’re going to add total_yards and total_touchdowns to our data to see what happens to our predictions.
wrselected <- wrdraftedstats %>%
select(
name,
year,
college_team,
nfl_team,
overall,
total_yards,
total_touchdowns,
FantPt
) %>% na.omit()Before we get to the recipe, let’s split our data.
player_split <- initial_split(wrselected, prop = .8)
player_train <- training(player_split)
player_test <- testing(player_split)11.2 Implementing XGBoost
Our player recipe will remain unchanged because we’re using the . notation to mean “everything that isn’t an ID or a predictor.”
player_recipe <-
recipe(FantPt ~ ., data = player_train) %>%
update_role(name, year, college_team, nfl_team, new_role = "ID")
summary(player_recipe)# A tibble: 8 × 4
variable type role source
<chr> <chr> <chr> <chr>
1 name nominal ID original
2 year numeric ID original
3 college_team nominal ID original
4 nfl_team nominal ID original
5 overall numeric predictor original
6 total_yards numeric predictor original
7 total_touchdowns numeric predictor original
8 FantPt numeric outcome original
Our prediction will use the overall draft pick, their total yards in college and their total touchdowns in college. Does that predict Fantasy points better? Let’s implement multiple models side by side.
linear_mod <-
linear_reg() %>%
set_engine("lm") %>%
set_mode("regression")
rf_mod <-
rand_forest() %>%
set_engine("ranger") %>%
set_mode("regression")
xg_mod <- boost_tree(
trees = tune(),
learn_rate = tune(),
tree_depth = tune(),
min_n = tune(),
loss_reduction = tune(),
sample_size = tune(),
mtry = tune(),
) %>%
set_mode("regression") %>%
set_engine("xgboost")Now to create workflows.
linear_workflow <-
workflow() %>%
add_model(linear_mod) %>%
add_recipe(player_recipe)
rf_workflow <-
workflow() %>%
add_model(rf_mod) %>%
add_recipe(player_recipe)
xg_workflow <-
workflow() %>%
add_model(xg_mod) %>%
add_recipe(player_recipe)Now to tune the XGBoost model.
xgb_grid <- grid_latin_hypercube(
trees(),
tree_depth(),
min_n(),
loss_reduction(),
sample_size = sample_prop(),
finalize(mtry(), player_train),
learn_rate()
)
player_folds <- vfold_cv(player_train)
xgb_res <- tune_grid(
xg_workflow,
resamples = player_folds,
grid = xgb_grid,
control = control_grid(save_pred = TRUE)
)
best_rmse <- select_best(xgb_res, "rmse")
final_xgb <- finalize_workflow(
xg_workflow,
best_rmse
)Because there’s not a ton of data here, this goes relatively quickly. Now to create fits.
linear_fit <-
linear_workflow %>%
fit(data = player_train)
rf_fit <-
rf_workflow %>%
fit(data = player_train)
xg_fit <-
final_xgb %>%
fit(data = player_train)And now to make predictions.
linearpredict <-
linear_fit %>%
predict(new_data = player_train) %>%
bind_cols(player_train)
rfpredict <-
rf_fit %>%
predict(new_data = player_train) %>%
bind_cols(player_train)
xgpredict <-
xg_fit %>%
predict(new_data = player_train) %>%
bind_cols(player_train) For your assignment: Interpret the metrics output of each. Compare them. How does each model do relative to each other?
metrics(linearpredict, FantPt, .pred)# A tibble: 3 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 43.6
2 rsq standard 0.279
3 mae standard 34.2
metrics(rfpredict, FantPt, .pred)# A tibble: 3 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 24.6
2 rsq standard 0.839
3 mae standard 18.5
metrics(xgpredict, FantPt, .pred)# A tibble: 3 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 38.1
2 rsq standard 0.455
3 mae standard 28.6
For your assignment: Implement metrics for test. What happens?