7  Support Vector Machines

The last method of statistical learning that we’ll use is the Support Vector Machine. The concept to understand about the support vector machine is the concept of a hyperplane. First think of a hyperplane as a line – a means to separate wins and losses along an axis where a certain stat says this way is a win and that way is a loss. Each dot in the scatter plot is a support vector. The hyperplane separates the support vectors into their classes – it’s the line between winning and losing. When you have just two stats like that, it’s pretty easy. Where support vector machines get hard is when you have many predictors that create the hyperplane. Then, instead of a line, it becomes a multidimensional shape in a multidimensional space.

The further away from our hyperplane, the more confident we are in the prediction.

We’re going to implement a support vector machine along side a logistic regression and a lightGBM model.

Let’s start with our libraries, per usual.

library(tidyverse)
library(tidymodels)
library(zoo)
library(bonsai)
library(hoopR)

set.seed(1234)

We’ll continue to use what we’ve done for feature engineering – a rolling window of points per possession for team and opponent. You should be quite familiar with this by now.

teamgames <- load_mbb_team_box(seasons = 2015:2024)

teamstats <- teamgames |> 
  mutate(
    possessions = field_goals_attempted - offensive_rebounds + turnovers + (.475 * free_throws_attempted),
    ppp = team_score/possessions,
    oppp = opponent_team_score/possessions
  )

rollingteamstats <- teamstats |> 
  group_by(team_short_display_name, season) |>
  arrange(game_date) |>
  mutate(
    team_rolling_ppp = rollmean(lag(ppp, n=1), k=5, align="right", fill=NA),
    team_rolling_oppp = rollmean(lag(oppp, n=1), k=5, align="right", fill=NA)
    ) |> 
  ungroup()

team_side <- rollingteamstats |>
  select(
    game_id,
    team_id, 
    team_short_display_name, 
    opponent_team_id, 
    game_date, 
    season, 
    team_score, 
    team_rolling_ppp,
    team_rolling_oppp
    )

opponent_side <- team_side |>
  select(-opponent_team_id) |> 
  rename(
    opponent_team_id = team_id,
    opponent_short_display_name = team_short_display_name,
    opponent_score = team_score,
    opponent_rolling_ppp = team_rolling_ppp,
    opponent_rolling_oppp = team_rolling_oppp
  ) |>
  mutate(opponent_id = as.numeric(opponent_team_id)
)

games <- team_side |> inner_join(opponent_side)

games <- games |> mutate(
  team_result = as.factor(case_when(
    team_score > opponent_score ~ "W",
    opponent_score > team_score ~ "L"
)))

games$team_result <- relevel(games$team_result, ref="W")

modelgames <- games |> 
  select(
    game_id, 
    game_date, 
    team_short_display_name, 
    opponent_short_display_name, 
    season, 
    team_rolling_ppp, 
    team_rolling_oppp,
    opponent_rolling_ppp,
    opponent_rolling_oppp,
    team_result
    ) |>
  na.omit()

We’re going to go through the steps of modeling again, starting with splitting our modelgames data.

7.0.1 Exercise 1: setting up your data

game_split <- initial_split(modelgames, prop = .8)
game_train <- training(????_?????)
game_test <- testing(????_?????)

The recipe we’ll create is the same for both, so we’ll use it three times.

7.0.2 Exercise 2: setting up the receipe

So what data are we feeding into our recipe?

game_recipe <- 
  recipe(????_?????? ~ ., data = game_train) %>% 
  update_role(game_id, game_date, team_short_display_name, opponent_short_display_name, season, new_role = "ID") %>%
  step_normalize(all_predictors())

summary(game_recipe)
# A tibble: 10 × 4
   variable                    type      role      source  
   <chr>                       <list>    <chr>     <chr>   
 1 game_id                     <chr [2]> ID        original
 2 game_date                   <chr [1]> ID        original
 3 team_short_display_name     <chr [3]> ID        original
 4 opponent_short_display_name <chr [3]> ID        original
 5 season                      <chr [2]> ID        original
 6 team_rolling_ppp            <chr [2]> predictor original
 7 team_rolling_oppp           <chr [2]> predictor original
 8 opponent_rolling_ppp        <chr [2]> predictor original
 9 opponent_rolling_oppp       <chr [2]> predictor original
10 team_result                 <chr [3]> outcome   original

Now, we’re going to create three different model specifications. The first will be the logistic regression model definition, the second will be the lightgbm, the third is the svm.

log_mod <- 
  logistic_reg() %>% 
  set_engine("glm") %>%
  set_mode("classification")

lightgbm_mod <- 
  boost_tree() %>%
  set_engine("lightgbm") %>%
  set_mode(mode = "classification")

svm_mod <- 
  svm_poly() %>%
  set_engine("kernlab") %>%
  set_mode("classification") 

Now we have enough for our workflows. We have three models and one recipe.

7.0.3 Exercise 3: making workflows

log_workflow <- 
  workflow() %>% 
  add_model(???_mod) %>% 
  add_recipe(????_recipe)

lightgbm_workflow <- 
  workflow() %>% 
  add_model(light???_mod) %>% 
  add_recipe(game_recipe)

svm_workflow <- 
  workflow() %>% 
  add_model(svm_mod) %>% 
  add_recipe(????_??????)

Now we can fit our models to the data.

7.0.4 Exercise 4: fitting our models

log_fit <- 
  log_workflow %>% 
  fit(data = ????_?????)

lightgbm_fit <- 
  lightgbm_workflow %>% 
  fit(data = ????_?????)

svm_fit <- 
  svm_workflow %>% 
  fit(data = ????_?????)
 Setting default kernel parameters  

7.1 Prediction time

Now we can bind our predictions to the training data and see how we did.

logpredict <- log_fit %>% predict(new_data = game_train) %>%
  bind_cols(game_train) 

logpredict <- log_fit %>% predict(new_data = game_train, type="prob") %>%
  bind_cols(logpredict)

lightgbmpredict <- lightgbm_fit %>% predict(new_data = game_train) %>%
  bind_cols(game_train) 

lightgbmpredict <- lightgbm_fit %>% predict(new_data = game_train, type="prob") %>%
  bind_cols(lightgbmpredict)

svmpredict <- svm_fit %>% predict(new_data = game_train) %>%
  bind_cols(game_train) 

svmpredict <- svm_fit %>% predict(new_data = game_train, type="prob") %>%
  bind_cols(svmpredict)

Now, how did we do?

7.1.1 Exercise 5: The first metrics

What prediction dataset do we feed into our metrics? Let’s look first at the lightGBM.

metrics(?????????, team_result, .pred_class)
# A tibble: 2 × 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary         0.655
2 kap      binary         0.310

And now the SVM.

7.1.2 Exercise 6: SVM metrics

metrics(???predict, team_result, .pred_class)
# A tibble: 2 × 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary         0.639
2 kap      binary         0.278

Looks like the LightGBM did a little better than the SVM on training. But remember: Where a model makes its money is in data that it has never seen before.

First, we look at lightGBM.

lightgbmtestpredict <- lightgbm_fit %>% predict(new_data = game_test) %>%
  bind_cols(game_test)

lightgbmtestpredict <- lightgbm_fit %>% predict(new_data = game_test, type="prob") %>%
  bind_cols(lightgbmtestpredict)

metrics(lightgbmtestpredict, team_result, .pred_class)
# A tibble: 2 × 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary         0.633
2 kap      binary         0.267

Right at 63 percent. And now SVM.

svmtestpredict <- svm_fit %>% predict(new_data = game_test) %>%
  bind_cols(game_test)

svmtestpredict <- svm_fit %>% predict(new_data = game_test, type="prob") %>%
  bind_cols(svmtestpredict)

metrics(svmtestpredict, team_result, .pred_class)
# A tibble: 2 × 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary         0.636
2 kap      binary         0.272

Slightly better – very slightly. But it shows that SVM is a bit more robust to new data than the lightGBM.