3  Modeling and logistic regression

3.1 The basics

One of the most common – and seemingly least rigorous – parts of sports journalism is the prediction. There are no shortage of people making predictions about who will win a game or a league. Sure they have a method – looking at how a team is playing, looking at the players, consulting their gut – but rarely ever do you hear of a sports pundit using a model.

We’re going to change that. Throughout this class, you’ll learn how to use modeling to make predictions. Some of these methods will predict numeric values (like how many points will a team score based on certain inputs). Some will predict categorical values (W or L, Yes or No, All Star or Not).

There are lots of problems in the world where the answer is not a number but a classification: Did they win or lose? Did the player get drafted or no? Is this player a flight risk to transfer or not?

These are problems of classification and there are algorithms we can use to estimate the probability that X will be the outcome. How likely is it that this team with these stats will win this game?

Where this gets interesting is in the middle.

library(tidyverse)
library(tidymodels)
library(hoopR)
library(zoo)
library(gt)
set.seed(1234)

What we need to do here is get both sides of the game. We’ll start with getting the box scores and then we’re going to remove all games involving non-Division I schools.

games <- load_mbb_team_box(seasons = 2015:2024)

nond1 <- games |> group_by(team_id, season) |> tally() |> filter(n < 10) |> select(team_id)
nond1 <- pull(nond1)

df <- games |> filter(!team_id %in% nond1 & !opponent_team_id %in% nond1)

3.2 Feature engineering

Feature engineering is the process of using what you know about something – domain knowledge – to find features in data that can be used in machine learning algorithms. Sports is a great place for this because not only do we know a lot because we follow the sport, but lots of other people are looking at this all the time. Creativity is good.

A number of basketball heads – including Ken Pomeroy of KenPom fame – have noticed that one of the predictors of the outcome of basketball games are possession metrics. How efficient are teams with the possessions they have? Can’t score if you don’t have the ball, so how good is a team at pushing the play and getting more possessions, giving themselves more chances to score?

One problem? Possessions aren’t in typical metrics. They aren’t usually tracked. But you can estimate them from typical box scores. The way to do that is like this:

Possessions = Field Goal Attempts – Offensive Rebounds + Turnovers + (0.475 * Free Throw Attempts)

If you look at the data we already have, however, you’ll see possessions are not actually in the data. Which is unfortunate. But we can calculate it pretty easily.

Then we’ll use the possessions estimate formula to get that, so we can then calculate offensive and defensive efficiency – points per 100 possessions.

Then, we’re going to create season cumulative averages of those efficiencies, but we’re going to lag them by one game. Why? Because we don’t know what the offensive efficiency is for that game before that game. We only know what it was going into the game. So the lag gives us what each team’s metrics are before they play the game. That’s good. We need that for modeling. Otherwise, we’re cheating.

We’ll save that to a new dataframe called teamside.

3.2.1 Exercise 1: setting up your data

teamside <- df |> 
  group_by(team_short_display_name, season) |> 
  arrange(game_date) |> 
  mutate(
    team_??????????? = field_goals_attempted - offensive_rebounds + turnovers + (.475 * free_throws_attempted),
    team_points_per_?????????? = team_score/team_possessions,
    team_defensive_points_per_possession = opponent_team_score/team_possessions,
    team_offensive_efficiency = team_points_per_possession * 100,
    team_defensive_efficiency = team_defensive_points_per_possession * 100,
    team_season_offensive_efficiency = lag(cummean(team_offensive_efficiency), n=1),
    team_season_defensive_efficiency = lag(cummean(team_defensive_efficiency), n=1),  
    score_margin = team_score - opponent_team_score,
    absolute_score_margin = abs(score_margin)
  ) |> 
  filter(absolute_score_margin <= 40)

Now we’re going to repeat the process for the opponent, but we’re going to use some tricks here. Because we have box score data, it means every game is in here twice. So every team in one box score is the opponent in another. So we’re going to use that in our favor to create the opponent side of the equation here and then we’ll join it back together afterwards. We start by dropping the opponent_team_id, which we’ll get back by renaming the team_id to that, then renaming our team stats to opponent stats.

3.2.2 Exercise 2: Opponent side

opponentside <- teamside |> 
  select(-?????????????) |> 
  rename(
    opponent_team_id = team_id,
    opponent_season_offensive_efficiency = team_season_offensive_efficiency,
    opponent_season_defensive_efficiency = team_season_defensive_efficiency
  ) |> 
  select(
    game_id,
    opponent_team_id,
    opponent_season_offensive_efficiency,
    opponent_season_defensive_efficiency
  )

Now, let’s join them.

3.2.3 Exercise 3: Joining

bothsides <- ???????? |> inner_join(?????????)
Joining with `by = join_by(game_id, opponent_team_id)`

3.3 Modeling

Now we begin the process of creating a model. Modeling in data science has a ton of details, but the process for each model type is similar.

  1. Split your data into training and testing data sets. A common split is 80/20.
  2. Train the model on the training dataset.
  3. Evaluate the model on the training data.
  4. Apply the model to the testing data.
  5. Evaluate the model on the test data.

From there, it’s how you want to use the model. We’ll walk through a simple example here, using a simple model – a logistic regression model.

What we’re trying to do here is predict which team will win given their efficiency with the ball, expressed as the cumulative average efficiency. However, to make a prediction, we need to know their stats BEFORE the game – what we knew about the team going into the game in question.

3.3.1 Exercise 4: Who won?

The last problem to solve? Who won? We can add this with conditional logic. The other thing we’re doing here is we’re going to is we’re going to convert our new team_result column into a factor. What is a factor? A factor is a type of data in R that stores categorical values that have a limited number of differences. So wins and losses are a perfect factor. Modeling libraries are looking for factors so it can treat the differences in the data as categories, so that’s why we’re converting it here.

The logic? If the team score is greater than the opponent team score, that’s a win. Otherwise, they take the L as the kids say these days.

bothsides <- bothsides |> mutate(
  team_result = as.factor(case_when(
    team_score > ??????????? ~ "W",
    opponent_team_score > ???????? ~ "L"
)))

Now that we’ve done that, we need to look at the order of our factors.

3.3.2 Exercise 5: Looking at the factors

To do that, we first need to know what R sees when it sees our team_result factor. Is a win first or is a loss first?

levels(bothsides$????_??????)
[1] "L" "W"

The order listed here is the order they are in. What this means is that our predictions will be done through the lens of losses. That doesn’t make intuitive sense to us. We want to know who will win! We can reorder the factors with relevel.

3.3.3 Exercise 6: Releveling the factors

bothsides$team_result <- relevel(bothsides$team_result, ref="?")

levels(bothsides$team_result)
[1] "W" "L"

For simplicity, let’s limit the number of columns we’re going to feed our model.

modelgames <- bothsides |> 
  select(
    game_id, 
    game_date, 
    team_short_display_name, 
    opponent_team_short_display_name, 
    season, 
    team_season_offensive_efficiency,
    team_season_defensive_efficiency,
    opponent_season_offensive_efficiency,
    opponent_season_defensive_efficiency,
    team_result
    ) |> na.omit()

3.4 Visualizing the decision boundary

This is just one dimension of the data, but it can illustrate how this works. You can almost see a line running through the middle, with a lot of overlap. The further left or right you go, the less overlap. You can read it like this: If this team scores this efficiently and the opponent scores this efficiently, most of the time this team wins. Or loses. It just depends on where the dot ends up.

That neatly captures the probabilities we’re looking at here.

ggplot() + 
  geom_point(
    data=modelgames, aes(x=team_season_offensive_efficiency, y=opponent_season_offensive_efficiency, color=team_result))

3.5 The logistic regression

To create a model, we have to go through a process. That process starts with splitting data where we know the outomes into two groups – training and testing. The training data is what we will use to create our model. The testing data is how we will determine how good it is. Then, going forward, our model can predict games we haven’t seen yet.

To do this, we’re going to first split our modelgames data into two groups – with 80 percent of it in one, 20 percent in the other. We do that by feeding our simplified dataframe into the initial_split function. Then we’ll explicitly name those into new dataframes called train and test.

3.5.1 Exercise 7: What are we splitting?

log_split <- initial_split(?????????, prop = .8)
log_train <- training(log_split)
log_test <- testing(log_split)

Now we have two dataframes – log_train and log_test – that we can now use for modeling.

First step to making a model is to set what type of model this will be. We’re going to name our model object – log_mod works because this is a logistic regression model. We’ll use the logistic_reg function in parsnip (the modeling library in Tidymodels) and set the engine to “glm”. The mode in our case is “classification” because we’re trying to classify something as a W or L. Later, we’ll use “regression” to predict numbers.

log_mod <- 
  logistic_reg() |> 
  set_engine("glm") |>
  set_mode("classification")

The next step is to create a recipe. This is a series of steps we’ll use to put our data into our model. For example – what is predicting what? And what aren’t predictors and what are? And do we have to do any pre-processing of the data?

The first part of the recipe is the formula. In this case, we’re saying – in real words – team_result is approximately modeled by our predictors, which we represent as . which means all the stuff. Then, importantly, we say what isn’t a predictor next with update_role. So the team name, the game date and things like that are not predictors. So we need to tell it that. The last step is normalizing our numbers. With logistic regression, scale differences in numbers can skew things, so we’re going to turn everything into Z-scores.

log_recipe <- 
  recipe(team_result ~ ., data = log_train) |> 
  update_role(game_id, game_date, team_short_display_name, opponent_team_short_display_name, season, new_role = "ID") |>
  step_normalize(all_predictors())

summary(log_recipe)
# A tibble: 10 × 4
   variable                             type      role      source  
   <chr>                                <list>    <chr>     <chr>   
 1 game_id                              <chr [2]> ID        original
 2 game_date                            <chr [1]> ID        original
 3 team_short_display_name              <chr [3]> ID        original
 4 opponent_team_short_display_name     <chr [3]> ID        original
 5 season                               <chr [2]> ID        original
 6 team_season_offensive_efficiency     <chr [2]> predictor original
 7 team_season_defensive_efficiency     <chr [2]> predictor original
 8 opponent_season_offensive_efficiency <chr [2]> predictor original
 9 opponent_season_defensive_efficiency <chr [2]> predictor original
10 team_result                          <chr [3]> outcome   original

Now we have enough for a workflow. A workflow is what we use to put it all together. In it, we add our model definition and our recipe.

3.5.2 Exercise 8: Making a workflow

log_workflow <- 
  workflow() |> 
  add_model(log_???) |> 
  add_recipe(log_??????)

And now we fit our model (this can take a few minutes).

log_fit <- 
  log_workflow |> 
  fit(data = log_train)

3.6 Evaluating the fit

With logistic regression, there’s two things we’re looking at: The prediction and the probabilities. We can get those with two different fits and combine them together.

First, you can see the predictions like this:

trainpredict <- log_fit |> predict(new_data = log_train) |>
  bind_cols(log_train)

head(trainpredict)
# A tibble: 6 × 11
  .pred_class   game_id game_date  team_short_display_n…¹ opponent_team_short_…²
  <fct>           <int> <date>     <chr>                  <chr>                 
1 L           401096956 2018-11-23 Marquette              Louisville            
2 W           400843818 2016-01-17 S Illinois             Drake                 
3 W           400986857 2017-12-21 BYU                    Idaho St              
4 L           401491229 2023-01-11 Loyola MD              Bucknell              
5 L           401489063 2022-12-10 N Arizona              Utah Valley           
6 W           401484595 2023-02-04 Tennessee              Auburn                
# ℹ abbreviated names: ¹​team_short_display_name,
#   ²​opponent_team_short_display_name
# ℹ 6 more variables: season <int>, team_season_offensive_efficiency <dbl>,
#   team_season_defensive_efficiency <dbl>,
#   opponent_season_offensive_efficiency <dbl>,
#   opponent_season_defensive_efficiency <dbl>, team_result <fct>

Then, we can just add it to trainpredict using bind_cols, which means we’re going to bind the columns of this new fit to the old trainpredict.

trainpredict <- log_fit |> predict(new_data = log_train, type="prob") |>
  bind_cols(trainpredict)

head(trainpredict)
# A tibble: 6 × 13
  .pred_W .pred_L .pred_class   game_id game_date  team_short_display_name
    <dbl>   <dbl> <fct>           <int> <date>     <chr>                  
1   0.338   0.662 L           401096956 2018-11-23 Marquette              
2   0.772   0.228 W           400843818 2016-01-17 S Illinois             
3   0.773   0.227 W           400986857 2017-12-21 BYU                    
4   0.271   0.729 L           401491229 2023-01-11 Loyola MD              
5   0.259   0.741 L           401489063 2022-12-10 N Arizona              
6   0.707   0.293 W           401484595 2023-02-04 Tennessee              
# ℹ 7 more variables: opponent_team_short_display_name <chr>, season <int>,
#   team_season_offensive_efficiency <dbl>,
#   team_season_defensive_efficiency <dbl>,
#   opponent_season_offensive_efficiency <dbl>,
#   opponent_season_defensive_efficiency <dbl>, team_result <fct>

There’s several metrics to look at to evaluate the model on our training data, but the two we will use are accuracy and kap. They both are pointing toward how well the model did in two different ways. The accuracy metric looks at the number of predictions that are correct when compared to known results. The inputs here are the data, the column that has the actual result, and the column with the prediction, called .pred_class.

3.6.1 Exercise 9: Metrics

metrics(trainpredict, ????_??????, .pred_class)
# A tibble: 2 × 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary         0.677
2 kap      binary         0.355

So how accurate is our model? If we’re looking for perfection, we’re far from it. But if we’re looking to make straight up win loss bets … we’re doing okay!

Another way to look at the results is the confusion matrix. The confusion matrix shows what was predicted compared to what actually happened. The squares are True Positives, False Positives, True Negatives and False Negatives. True values vs the total values make up the accuracy.

3.6.2 Exercise 10: Confusion matrix

trainpredict |>
  conf_mat(????_result, .pred_?????)
          Truth
Prediction     W     L
         W 26403 12595
         L 12620 26538

3.7 Comparing it to test data

Now we can apply our fit to the test data to see how robust it is. If the metrics are similar, that’s good – it means our model is robust. If the metrics change a lot, that’s bad. It means our model is guessing.

testpredict <- log_fit |> predict(new_data = log_test) |>
  bind_cols(log_test)

testpredict <- log_fit |> predict(new_data = log_test, type="prob") |>
  bind_cols(testpredict)

And now some metrics on the test data.

3.7.1 Exercise 11: Testing

metrics(????predict, team_result, .pred_class)
# A tibble: 2 × 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary         0.679
2 kap      binary         0.358

How does that compare to our training data? Is it lower? Higher? Are the changes large – like are we talking about single digit changes or double digit changes? The less it changes, the better.

And now the confusion matrix.

testpredict |>
  conf_mat(team_result, .pred_class)
          Truth
Prediction    W    L
         W 6604 3052
         L 3221 6663

How does that compare to the training data?

3.8 How well did it do with Nebraska?

Let’s grab predictions for Nebraska from both our test and train data and take a look.

nutrain <- trainpredict |> filter(team_short_display_name == "Nebraska" &  season == 2024)

nutest <- testpredict |> filter(team_short_display_name == "Nebraska" & season == 2024)

bind_rows(nutrain, nutest) |> 
  arrange(game_date) |> 
  select(.pred_W, .pred_class, team_result, team_short_display_name, opponent_team_short_display_name) |> 
  gt()
.pred_W .pred_class team_result team_short_display_name opponent_team_short_display_name
0.9903245 W W Nebraska Rider
0.9655871 W W Nebraska Stony Brook
0.7899433 W W Nebraska Oregon St
0.6840794 W W Nebraska Duquesne
0.9601307 W W Nebraska Fullerton
0.4456791 L L Nebraska Creighton
0.5510326 W L Nebraska Minnesota
0.5024072 W W Nebraska Michigan St
0.4904807 L W Nebraska Kansas St
0.7897485 W W Nebraska North Dakota
0.8886867 W W Nebraska SC State
0.6859965 W W Nebraska Indiana
0.5322284 W L Nebraska Wisconsin
0.3198108 L W Nebraska Purdue
0.5361151 W L Nebraska Iowa
0.6545672 W L Nebraska Rutgers
0.5030016 W W Nebraska Northwestern
0.4291228 L W Nebraska Ohio State
0.5574689 W L Nebraska Maryland
0.4150621 L W Nebraska Wisconsin
0.3412217 L L Nebraska Illinois
0.4922467 L L Nebraska Northwestern
0.6401907 W W Nebraska Michigan
0.5893802 W W Nebraska Penn State
0.6827122 W W Nebraska Indiana
0.4931792 L W Nebraska Minnesota
0.5560308 W L Nebraska Ohio State

By our cumulative metrics, are there any surprises? Should we have beaten Creighton or Purdue?

How could you improve this?