library(tidyverse)
library(tidymodels)
library(zoo)
library(bonsai)
library(hoopR)
set.seed(1234)
7 Support Vector Machines
The last method of statistical learning that we’ll use is the Support Vector Machine. The concept to understand about the support vector machine is the concept of a hyperplane. First think of a hyperplane as a line – a means to separate wins and losses along an axis where a certain stat says this way is a win and that way is a loss. Each dot in the scatter plot is a support vector. The hyperplane separates the support vectors into their classes – it’s the line between winning and losing. When you have just two stats like that, it’s pretty easy. Where support vector machines get hard is when you have many predictors that create the hyperplane. Then, instead of a line, it becomes a multidimensional shape in a multidimensional space.
The further away from our hyperplane, the more confident we are in the prediction.
We’re going to implement a support vector machine along side a logistic regression and a lightGBM model.
Let’s start with our libraries, per usual.
We’ll continue to use what we’ve done for feature engineering – a rolling window of points per possession for team and opponent. You should be quite familiar with this by now.
<- load_mbb_team_box(seasons = 2015:2024)
teamgames
<- teamgames |>
teamstats mutate(
possessions = field_goals_attempted - offensive_rebounds + turnovers + (.475 * free_throws_attempted),
ppp = team_score/possessions,
oppp = opponent_team_score/possessions
)
<- teamstats |>
rollingteamstats group_by(team_short_display_name, season) |>
arrange(game_date) |>
mutate(
team_rolling_ppp = rollmean(lag(ppp, n=1), k=5, align="right", fill=NA),
team_rolling_oppp = rollmean(lag(oppp, n=1), k=5, align="right", fill=NA)
|>
) ungroup()
<- rollingteamstats |>
team_side select(
game_id,
team_id,
team_short_display_name,
opponent_team_id,
game_date,
season,
team_score,
team_rolling_ppp,
team_rolling_oppp
)
<- team_side |>
opponent_side select(-opponent_team_id) |>
rename(
opponent_team_id = team_id,
opponent_short_display_name = team_short_display_name,
opponent_score = team_score,
opponent_rolling_ppp = team_rolling_ppp,
opponent_rolling_oppp = team_rolling_oppp
|>
) mutate(opponent_id = as.numeric(opponent_team_id)
)
<- team_side |> inner_join(opponent_side)
games
<- games |> mutate(
games team_result = as.factor(case_when(
> opponent_score ~ "W",
team_score > team_score ~ "L"
opponent_score
)))
$team_result <- relevel(games$team_result, ref="W")
games
<- games |>
modelgames select(
game_id,
game_date,
team_short_display_name,
opponent_short_display_name,
season,
team_rolling_ppp,
team_rolling_oppp,
opponent_rolling_ppp,
opponent_rolling_oppp,
team_result|>
) na.omit()
We’re going to go through the steps of modeling again, starting with splitting our modelgames
data.
7.0.1 Exercise 1: setting up your data
<- initial_split(modelgames, prop = .8)
game_split <- training(????_?????)
game_train <- testing(????_?????) game_test
The recipe we’ll create is the same for both, so we’ll use it three times.
7.0.2 Exercise 2: setting up the receipe
So what data are we feeding into our recipe?
<-
game_recipe recipe(????_?????? ~ ., data = game_train) %>%
update_role(game_id, game_date, team_short_display_name, opponent_short_display_name, season, new_role = "ID") %>%
step_normalize(all_predictors())
summary(game_recipe)
# A tibble: 10 × 4
variable type role source
<chr> <list> <chr> <chr>
1 game_id <chr [2]> ID original
2 game_date <chr [1]> ID original
3 team_short_display_name <chr [3]> ID original
4 opponent_short_display_name <chr [3]> ID original
5 season <chr [2]> ID original
6 team_rolling_ppp <chr [2]> predictor original
7 team_rolling_oppp <chr [2]> predictor original
8 opponent_rolling_ppp <chr [2]> predictor original
9 opponent_rolling_oppp <chr [2]> predictor original
10 team_result <chr [3]> outcome original
Now, we’re going to create three different model specifications. The first will be the logistic regression model definition, the second will be the lightgbm, the third is the svm.
<-
log_mod logistic_reg() %>%
set_engine("glm") %>%
set_mode("classification")
<-
lightgbm_mod boost_tree() %>%
set_engine("lightgbm") %>%
set_mode(mode = "classification")
<-
svm_mod svm_poly() %>%
set_engine("kernlab") %>%
set_mode("classification")
Now we have enough for our workflows. We have three models and one recipe.
7.0.3 Exercise 3: making workflows
<-
log_workflow workflow() %>%
add_model(???_mod) %>%
add_recipe(????_recipe)
<-
lightgbm_workflow workflow() %>%
add_model(light???_mod) %>%
add_recipe(game_recipe)
<-
svm_workflow workflow() %>%
add_model(svm_mod) %>%
add_recipe(????_??????)
Now we can fit our models to the data.
7.0.4 Exercise 4: fitting our models
<-
log_fit %>%
log_workflow fit(data = ????_?????)
<-
lightgbm_fit %>%
lightgbm_workflow fit(data = ????_?????)
<-
svm_fit %>%
svm_workflow fit(data = ????_?????)
Setting default kernel parameters
7.1 Prediction time
Now we can bind our predictions to the training data and see how we did.
<- log_fit %>% predict(new_data = game_train) %>%
logpredict bind_cols(game_train)
<- log_fit %>% predict(new_data = game_train, type="prob") %>%
logpredict bind_cols(logpredict)
<- lightgbm_fit %>% predict(new_data = game_train) %>%
lightgbmpredict bind_cols(game_train)
<- lightgbm_fit %>% predict(new_data = game_train, type="prob") %>%
lightgbmpredict bind_cols(lightgbmpredict)
<- svm_fit %>% predict(new_data = game_train) %>%
svmpredict bind_cols(game_train)
<- svm_fit %>% predict(new_data = game_train, type="prob") %>%
svmpredict bind_cols(svmpredict)
Now, how did we do?
7.1.1 Exercise 5: The first metrics
What prediction dataset do we feed into our metrics? Let’s look first at the lightGBM.
metrics(?????????, team_result, .pred_class)
# A tibble: 2 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 accuracy binary 0.655
2 kap binary 0.310
And now the SVM.
7.1.2 Exercise 6: SVM metrics
metrics(???predict, team_result, .pred_class)
# A tibble: 2 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 accuracy binary 0.639
2 kap binary 0.278
Looks like the LightGBM did a little better than the SVM on training. But remember: Where a model makes its money is in data that it has never seen before.
First, we look at lightGBM.
<- lightgbm_fit %>% predict(new_data = game_test) %>%
lightgbmtestpredict bind_cols(game_test)
<- lightgbm_fit %>% predict(new_data = game_test, type="prob") %>%
lightgbmtestpredict bind_cols(lightgbmtestpredict)
metrics(lightgbmtestpredict, team_result, .pred_class)
# A tibble: 2 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 accuracy binary 0.633
2 kap binary 0.267
Right at 63 percent. And now SVM.
<- svm_fit %>% predict(new_data = game_test) %>%
svmtestpredict bind_cols(game_test)
<- svm_fit %>% predict(new_data = game_test, type="prob") %>%
svmtestpredict bind_cols(svmtestpredict)
metrics(svmtestpredict, team_result, .pred_class)
# A tibble: 2 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 accuracy binary 0.636
2 kap binary 0.272
Slightly better – very slightly. But it shows that SVM is a bit more robust to new data than the lightGBM.