library(tidyverse)
library(tidymodels)
set.seed(1234)
11 XGBoost for regression
11.1 The basics
And now we return to XGBoost, but now for regression. Recall that boosting methods are another wrinkle in the tree based methods. Instead of deep trees, boosting methods intentionally pick shallow trees – called stumps – that, at least initially, do a poor job of predicting the outcome. Then, each subsequent stump takes the job the previous one did, optimizes to reduce the residuals – the gap between prediction and reality – and makes a prediction. And then the next one does the same, and so on and so on.
So far, our linear regression and random forest methods aren’t that great. Does XGBoost, with it’s method of optimizing for reduced error, fare any better? Let’s try, but this time we’re going to add more information. We’re going to add college receiving stats.
As always, we start with libraries.
We’re going to load a new data file this time. It’s called wrdraftedstats and it now has wide receivers who were drafted, their fantasy stats, and their college career stats.
<- read_csv("https://mattwaite.github.io/sportsdatafiles/wrdraftedstats20132022.csv") wrdraftedstats
Let’s narrow that down to just our columns we need. We’re going to add total_yards and total_touchdowns to our data to see what happens to our predictions.
<- wrdraftedstats %>%
wrselected select(
name,
year,
college_team,
nfl_team,
overall,
total_yards,
total_touchdowns,
FantPt%>% na.omit() )
Before we get to the recipe, let’s split our data.
<- initial_split(wrselected, prop = .8)
player_split
<- training(player_split)
player_train <- testing(player_split) player_test
11.2 Implementing XGBoost
Our player recipe will remain unchanged because we’re using the . notation to mean “everything that isn’t an ID or a predictor.”
<-
player_recipe recipe(FantPt ~ ., data = player_train) %>%
update_role(name, year, college_team, nfl_team, new_role = "ID")
summary(player_recipe)
# A tibble: 8 × 4
variable type role source
<chr> <chr> <chr> <chr>
1 name nominal ID original
2 year numeric ID original
3 college_team nominal ID original
4 nfl_team nominal ID original
5 overall numeric predictor original
6 total_yards numeric predictor original
7 total_touchdowns numeric predictor original
8 FantPt numeric outcome original
Our prediction will use the overall draft pick, their total yards in college and their total touchdowns in college. Does that predict Fantasy points better? Let’s implement multiple models side by side.
<-
linear_mod linear_reg() %>%
set_engine("lm") %>%
set_mode("regression")
<-
rf_mod rand_forest() %>%
set_engine("ranger") %>%
set_mode("regression")
<- boost_tree(
xg_mod trees = tune(),
learn_rate = tune(),
tree_depth = tune(),
min_n = tune(),
loss_reduction = tune(),
sample_size = tune(),
mtry = tune(),
%>%
) set_mode("regression") %>%
set_engine("xgboost")
Now to create workflows.
<-
linear_workflow workflow() %>%
add_model(linear_mod) %>%
add_recipe(player_recipe)
<-
rf_workflow workflow() %>%
add_model(rf_mod) %>%
add_recipe(player_recipe)
<-
xg_workflow workflow() %>%
add_model(xg_mod) %>%
add_recipe(player_recipe)
Now to tune the XGBoost model.
<- grid_latin_hypercube(
xgb_grid trees(),
tree_depth(),
min_n(),
loss_reduction(),
sample_size = sample_prop(),
finalize(mtry(), player_train),
learn_rate()
)
<- vfold_cv(player_train)
player_folds
<- tune_grid(
xgb_res
xg_workflow,resamples = player_folds,
grid = xgb_grid,
control = control_grid(save_pred = TRUE)
)
<- select_best(xgb_res, "rmse")
best_rmse
<- finalize_workflow(
final_xgb
xg_workflow,
best_rmse )
Because there’s not a ton of data here, this goes relatively quickly. Now to create fits.
<-
linear_fit %>%
linear_workflow fit(data = player_train)
<-
rf_fit %>%
rf_workflow fit(data = player_train)
<-
xg_fit %>%
final_xgb fit(data = player_train)
And now to make predictions.
<-
linearpredict %>%
linear_fit predict(new_data = player_train) %>%
bind_cols(player_train)
<-
rfpredict %>%
rf_fit predict(new_data = player_train) %>%
bind_cols(player_train)
<-
xgpredict %>%
xg_fit predict(new_data = player_train) %>%
bind_cols(player_train)
For your assignment: Interpret the metrics output of each. Compare them. How does each model do relative to each other?
metrics(linearpredict, FantPt, .pred)
# A tibble: 3 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 43.6
2 rsq standard 0.279
3 mae standard 34.2
metrics(rfpredict, FantPt, .pred)
# A tibble: 3 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 24.6
2 rsq standard 0.839
3 mae standard 18.5
metrics(xgpredict, FantPt, .pred)
# A tibble: 3 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 38.1
2 rsq standard 0.455
3 mae standard 28.6
For your assignment: Implement metrics for test. What happens?