6  LightGBM

6.1 The basics

LightGBM is another tree-based method that is similar to XGBoost but differs in ways that make it computationally more efficient. Where XGBoost and Random Forests are based on branches, LightGBM grows leaf-wise. Think of it like this – XGBoost uses lots of short trees with branches – it comes to a fork, makes a decision that reduces the amount of error the last tree got, and makes a new branch. LightGBM on the other hand, makes new leaves of each branch, which can mean lots of little splits, instead of big ones. That can lead to over-fitting on small datasets, but it also means it’s much faster than XGBoost.

LightGBM also uses histograms of the data to make choices, where XGBoost is comptutationally optimizing those choices. Roughly translated – LightGBM is looking at the fat part of a normal distribution to make choices, where XGBoost is tuning parameters to find the optimal path forward. It’s another reason why LightGBM is faster, but also not reliable with small datasets.

Let’s implement a LightGBM model. We start with libraries.

library(tidyverse)
library(tidymodels)
library(hoopR)
library(zoo)
library(bonsai)

set.seed(1234)

We’ll continue to use what we’ve done for feature engineering – a rolling window of points per possession for team and opponent. You should be quite familiar with this by now.

teamgames <- load_mbb_team_box(seasons = 2015:2024)

teamstats <- teamgames |> 
  mutate(
    possessions = field_goals_attempted - offensive_rebounds + turnovers + (.475 * free_throws_attempted),
    ppp = team_score/possessions,
    oppp = opponent_team_score/possessions
  )

rollingteamstats <- teamstats |> 
  group_by(team_short_display_name, season) |>
  arrange(game_date) |>
  mutate(
    team_rolling_ppp = rollmean(lag(ppp, n=1), k=5, align="right", fill=NA),
    team_rolling_oppp = rollmean(lag(oppp, n=1), k=5, align="right", fill=NA)
    ) |> 
  ungroup()

team_side <- rollingteamstats |>
  select(
    game_id,
    team_id, 
    team_short_display_name, 
    opponent_team_id, 
    game_date, 
    season, 
    team_score, 
    team_rolling_ppp,
    team_rolling_oppp
    )

opponent_side <- team_side |>
  select(-opponent_team_id) |> 
  rename(
    opponent_team_id = team_id,
    opponent_short_display_name = team_short_display_name,
    opponent_score = team_score,
    opponent_rolling_ppp = team_rolling_ppp,
    opponent_rolling_oppp = team_rolling_oppp
  ) |>
  mutate(opponent_id = as.numeric(opponent_team_id)
)

games <- team_side |> inner_join(opponent_side)

games <- games |> mutate(
  team_result = as.factor(case_when(
    team_score > opponent_score ~ "W",
    opponent_score > team_score ~ "L"
)))

games$team_result <- relevel(games$team_result, ref="W")

modelgames <- games |> 
  select(
    game_id, 
    game_date, 
    team_short_display_name, 
    opponent_short_display_name, 
    season, 
    team_rolling_ppp, 
    team_rolling_oppp,
    opponent_rolling_ppp,
    opponent_rolling_oppp,
    team_result
    ) |>
  na.omit()

For this tutorial, we’re going to create three models from three workflows so that we can compare a logistic regression to a random forest to a lightbgm model.

6.2 Setup

We’re going to go through the steps of modeling again, starting with splitting our modelgames data.

6.2.1 Exercise 1: setting up your data

game_split <- initial_split(??????????, prop = .8)
game_train <- training(game_split)
game_test <- testing(game_split)

The recipe we’ll create is the same for both, so we’ll use it three times.

6.2.2 Exercise 2: setting up the receipe

So what data are we feeding into our recipe?

game_recipe <- 
  recipe(team_result ~ ., data = game_?????) |> 
  update_role(game_id, game_date, team_short_display_name, opponent_short_display_name, season, new_role = "ID") |>
  step_normalize(all_predictors())

summary(game_recipe)
# A tibble: 10 × 4
   variable                    type      role      source  
   <chr>                       <list>    <chr>     <chr>   
 1 game_id                     <chr [2]> ID        original
 2 game_date                   <chr [1]> ID        original
 3 team_short_display_name     <chr [3]> ID        original
 4 opponent_short_display_name <chr [3]> ID        original
 5 season                      <chr [2]> ID        original
 6 team_rolling_ppp            <chr [2]> predictor original
 7 team_rolling_oppp           <chr [2]> predictor original
 8 opponent_rolling_ppp        <chr [2]> predictor original
 9 opponent_rolling_oppp       <chr [2]> predictor original
10 team_result                 <chr [3]> outcome   original

Now, we’re going to create three different model specifications. The first will be the logistic regression model definition, the second will be the random forest, the third is the lightgbm.

log_mod <- 
  logistic_reg() |> 
  set_engine("glm") |>
  set_mode("classification")

rf_mod <- 
  rand_forest() |> 
  set_engine("ranger") |>
  set_mode("classification")

lightgbm_mod <- 
  boost_tree() |>
  set_engine("lightgbm") |>
  set_mode(mode = "classification")

Now we have enough for our workflows. We have three models and one recipe.

6.2.3 Exercise 3: making workflows

log_workflow <- 
  workflow() |> 
  add_model(???_mod) |> 
  add_recipe(????_recipe)

rf_workflow <- 
  workflow() |> 
  add_model(??_mod) |> 
  add_recipe(????_recipe)

lightgbm_workflow <- 
  workflow() |> 
  add_model(light???_mod) |> 
  add_recipe(game_recipe)

Now we can fit our models to the data.

6.2.4 Exercise 4: fitting our models

log_fit <- 
  log_workflow |> 
  fit(data = ????_?????)

rf_fit <- 
  rf_workflow |> 
  fit(data = ????_?????)

lightgbm_fit <- 
  lightgbm_workflow |> 
  fit(data = ????_?????)

6.3 Prediction time

Now we can bind our predictions to the training data and see how we did.

logpredict <- log_fit |> predict(new_data = game_train) |>
  bind_cols(game_train) 

logpredict <- log_fit |> predict(new_data = game_train, type="prob") |>
  bind_cols(logpredict)

rfpredict <- rf_fit |> predict(new_data = game_train) |>
  bind_cols(game_train) 

rfpredict <- rf_fit |> predict(new_data = game_train, type="prob") |>
  bind_cols(rfpredict)

lightgbmpredict <- lightgbm_fit |> predict(new_data = game_train) |>
  bind_cols(game_train) 

lightgbmpredict <- lightgbm_fit |> predict(new_data = game_train, type="prob") |>
  bind_cols(lightgbmpredict)

Now, how did we do?

6.3.1 Exercise 5: The first metrics

What prediction dataset do we feed into our metrics? Let’s look first at the random forest, because it’s a tree-based method just like lightgbm.

metrics(?????????, team_result, .pred_class)
# A tibble: 2 × 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary         0.993
2 kap      binary         0.985

Same as last time, the random forest produces bonkers training numbers. Can you say overfit?

How about the lightgbm?

6.3.2 Exercise 6: LightGBM metrics

metrics(????????predict, team_result, .pred_class)
# A tibble: 2 × 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary         0.655
2 kap      binary         0.310

About 66 percent accuracy. Which, if you’ll recall, is a few percentage points better than logistic regression, and worse than random forest WITH A HUGE ASTERISK.

Remember: Where a model makes its money is in data that it has never seen before.

First, we look at random forest. The inevitable crash with random forests.

rftestpredict <- rf_fit |> predict(new_data = game_test) |>
  bind_cols(game_test)

rftestpredict <- rf_fit |> predict(new_data = game_test, type="prob") |>
  bind_cols(rftestpredict)

metrics(rftestpredict, team_result, .pred_class)
# A tibble: 2 × 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary         0.618
2 kap      binary         0.237

Right at 62 percent. A little bit lower than logistic regression. But did they come to the same answers to get those numbers? No.

And now lightGBM.

lightgbmtestpredict <- lightgbm_fit |> predict(new_data = game_test) |>
  bind_cols(game_test)

lightgbmtestpredict <- lightgbm_fit |> predict(new_data = game_test, type="prob") |>
  bind_cols(lightgbmtestpredict)

metrics(lightgbmtestpredict, team_result, .pred_class)
# A tibble: 2 × 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary         0.633
2 kap      binary         0.267

Our three models, based on our very basic feature engineering, are still only slightly better than flipping a coin. If we want to get better, we’ve still got work to do.