Another year, another attempt, another bracket disaster

code
analysis
Author

Matt Waite

Published

March 28, 2022

Once again, I attempted to predict the outcome of the NCAA tournament using machine learning, and I had a class-load of students try the same.

If you like the madness part of March Madness, this year is for you.

It is not for machine learning algorithms based on regular season performance. At least not mine.

Of the 14 brackets I and my students produced, using 14 different methods, we came up with 7 unique national title winners.

Zero of them are right.

The best brackets had two of the four Final Four teams, but none picked a team still playing now to win it all with the Final Four set.

I had hope for my bracket picking algorithm this year. It had what felt like a good mix of upsets and favorites, some madness and some method.

What follows is a post mortem and an attempt to figure out what went wrong. If anything went wrong. How do you predict St. Peter’s? You don’t. It’s what makes this fun.

Code
library(tidyverse)
library(tidymodels)
library(zoo)
library(hoopR)
library(gt)

set.seed(1234)

kenpom <- read_csv("http://mattwaite.github.io/sportsdatafiles/ratings.csv")
namekey <- read_csv("http://mattwaite.github.io/sportsdatafiles/nametable.csv")
simplestats <- read_csv("http://mattwaite.github.io/sportsdatafiles/simplestats.csv")

teamgames <- load_mbb_team_box(seasons = 2015:2022) %>%
  filter(game_date < as.Date("2022-03-17")) %>%
  separate(field_goals_made_field_goals_attempted, into = c("field_goals_made","field_goals_attempted")) %>%
  separate(three_point_field_goals_made_three_point_field_goals_attempted, into = c("three_point_field_goals_made","three_point_field_goals_attempted")) %>%
  separate(free_throws_made_free_throws_attempted, into = c("free_throws_made","free_throws_attempted")) %>%
  mutate_at(12:34, as.numeric) %>% 
  mutate(team_id = as.numeric(team_id))

teamgames <- teamgames %>% left_join(namekey) %>% left_join(kenpom, by=c("team" = "team", "season"="year")) %>% left_join(simplestats, by=c("School" = "School", "season" = "Season"))

teamstats <- teamgames %>% 
  group_by(team_short_display_name, season) %>%
  arrange(game_date) %>%
  mutate(
    team_score = ((field_goals_made-three_point_field_goals_made) * 2) + (three_point_field_goals_made*3) + free_throws_made,
    possessions = field_goals_attempted - offensive_rebounds + turnovers + (.475 * free_throws_attempted),
    team_offensive_efficiency = (team_score/possessions)*100,
    true_shooting_percentage = (team_score / (2*(field_goals_attempted + (.44 * free_throws_attempted)))) * 100,
    turnover_pct = turnovers/(field_goals_attempted + 0.44 * free_throws_attempted + turnovers),
    free_throw_factor = free_throws_made/field_goals_attempted,
    team_rolling_true_shooting_percentage = rollmean(lag(true_shooting_percentage, n=1), k=10, align="right", fill=NA),
    team_rolling_turnover_percentage = rollmean(lag(turnover_pct, n=1), k=10, align="right", fill=NA),
    team_rolling_free_throw_factor = rollmean(lag(free_throw_factor, n=1), k=10, align="right", fill=NA), 
    team_cumulative_mean_true_shooting = lag(cummean(true_shooting_percentage), n=1, default=0),
    team_cumulative_mean_turnover_percentage = lag(cummean(turnover_pct), n=1, default=0),
    team_cumulative_mean_free_throw_factor = lag(cummean(free_throw_factor), n=1, default=0),
    team_cumulative_o_eff = lag(cummean(team_offensive_efficiency), n=1, default=0),
    team_rolling_o_eff = rollmean(lag(team_offensive_efficiency, n=1), k=10, align="right", fill=NA)
  ) %>% ungroup() %>%
  rename(
    team_sos = OverallSOS,
    team_srs = OverallSRS,
    team_luck = luck
  )

teamstats <- teamstats %>% 
  select(game_id, team_id, team_offensive_efficiency) %>%
  mutate(team_id = as.numeric(team_id)) %>% 
  rename(opponent_id = team_id, opponent_offensive_efficiency=team_offensive_efficiency) %>% 
  left_join(teamstats) %>%
  group_by(team_short_display_name, season) %>%
  arrange(game_date) %>%
  mutate(
    team_cumulative_d_eff = lag(cummean(opponent_offensive_efficiency), n=1, default=0),
    team_rolling_d_eff = rollmean(lag(opponent_offensive_efficiency, n=1), k=10, align="right", fill=NA)
    ) %>% ungroup()

opponent <- teamstats %>% select(game_id, team_id, offensive_rebounds, defensive_rebounds) %>% rename(opponent_id=team_id, opponent_offensive_rebounds = offensive_rebounds, opponent_defensive_rebounds=defensive_rebounds) %>% mutate(opponent_id = as.numeric(opponent_id))

newteamstats <- teamstats %>% 
  inner_join(opponent) %>% 
  mutate(
    orb = offensive_rebounds / (offensive_rebounds + opponent_defensive_rebounds),
    drb = defensive_rebounds / (opponent_offensive_rebounds + defensive_rebounds),
    team_rolling_orb = rollmean(lag(orb, n=1), k=10, align="right", fill=NA),
    team_rolling_drb = rollmean(lag(drb, n=1), k=10, align="right", fill=NA),
    team_cumulative_mean_orb = lag(cummean(orb), n=1, default=0),
    team_cumulative_mean_drb = lag(cummean(drb), n=1, default=0),
    team_efficiency_margin = team_cumulative_o_eff - team_cumulative_d_eff,
    team_recent_efficiency_margin = team_rolling_o_eff - team_rolling_d_eff,
    team_recency = team_recent_efficiency_margin - team_efficiency_margin
    ) 

team_side <- newteamstats %>%
  select(game_id, team_id, team_short_display_name, opponent_id, game_date, season, team_score, team_rolling_true_shooting_percentage, team_rolling_free_throw_factor, team_rolling_turnover_percentage, team_rolling_orb, team_rolling_drb, team_cumulative_mean_true_shooting, team_cumulative_mean_turnover_percentage, team_cumulative_mean_free_throw_factor, team_cumulative_mean_orb, team_cumulative_mean_drb, team_cumulative_o_eff, team_cumulative_d_eff, team_efficiency_margin, team_sos, team_srs, team_luck, team_recency) %>% na.omit()

opponent_side <- newteamstats %>%
  select(game_id, team_id, team_short_display_name, team_score, team_rolling_true_shooting_percentage, team_rolling_free_throw_factor, team_rolling_turnover_percentage, team_rolling_orb, team_rolling_drb, team_cumulative_mean_true_shooting, team_cumulative_mean_turnover_percentage, team_cumulative_mean_free_throw_factor, team_cumulative_mean_orb, team_cumulative_mean_drb, team_cumulative_o_eff, team_cumulative_d_eff, team_efficiency_margin, team_sos, team_srs, team_luck, team_recency) %>% na.omit() %>%
  rename(
    opponent_id = team_id,
    opponent_short_display_name = team_short_display_name,
    opponent_score = team_score,
    opponent_rolling_true_shooting_percentage = team_rolling_true_shooting_percentage,
    opponent_rolling_free_throw_factor = team_rolling_free_throw_factor,
    opponent_rolling_turnover_percentage = team_rolling_turnover_percentage,
    opponent_rolling_orb = team_rolling_orb,
    opponent_rolling_drb = team_rolling_drb,
    opponent_cumulative_mean_true_shooting = team_cumulative_mean_true_shooting,
    opponent_cumulative_mean_turnover_percentage = team_cumulative_mean_turnover_percentage,
    opponent_cumulative_mean_free_throw_factor = team_cumulative_mean_free_throw_factor,
    opponent_cumulative_mean_orb = team_cumulative_mean_orb,
    opponent_cumulative_mean_drb = team_cumulative_mean_drb,
    opponent_cumulative_o_eff = team_cumulative_o_eff,
    opponent_cumulative_d_eff = team_cumulative_d_eff,
    opponent_efficiency_margin = team_efficiency_margin,
    opponent_srs = team_srs,
    opponent_sos = team_sos,
    opponent_luck = team_luck,
    opponent_recency = team_recency
  ) %>%
  mutate(
    opponent_id = as.numeric(opponent_id)
    )

games <- team_side %>% inner_join(opponent_side) %>% mutate(
  TeamResult = as.factor(case_when(
    team_score > opponent_score ~ "W",
    opponent_score > team_score ~ "L"
))) %>% na.omit()

games$TeamResult <- relevel(games$TeamResult, ref="W")

cumulativesimplemodelgames <- games %>% select(game_id, game_date, team_short_display_name, opponent_short_display_name, season, opponent_efficiency_margin, team_efficiency_margin, team_sos, team_srs, opponent_sos, opponent_srs, opponent_luck, team_luck, opponent_recency, team_recency, TeamResult) %>% na.omit()

cumulative_split <- initial_split(cumulativesimplemodelgames, prop = .8)
cumulative_train <- training(cumulative_split)
cumulative_test <- testing(cumulative_split)

cumulative_recipe <- 
  recipe(TeamResult ~ ., data = cumulative_train) %>% 
  update_role(game_id, game_date, team_short_display_name, opponent_short_display_name, season, new_role = "ID") %>%
  step_normalize(all_predictors())

log_mod <- 
  logistic_reg() %>% 
  set_engine("glm") %>%
  set_mode("classification")

log_workflow <- 
  workflow() %>% 
  add_model(log_mod) %>% 
  add_recipe(cumulative_recipe)

log_fit <- 
  log_workflow %>% 
  fit(data = cumulative_train)

teamstats <- teamgames %>% 
  group_by(team_short_display_name, season) %>%
  arrange(game_date) %>%
  mutate(
    team_score = ((field_goals_made-three_point_field_goals_made) * 2) + (three_point_field_goals_made*3) + free_throws_made,
    possessions = field_goals_attempted - offensive_rebounds + turnovers + (.475 * free_throws_attempted),
    team_offensive_efficiency = (team_score/possessions)*100,
    true_shooting_percentage = (team_score / (2*(field_goals_attempted + (.44 * free_throws_attempted)))) * 100,
    turnover_pct = turnovers/(field_goals_attempted + 0.44 * free_throws_attempted + turnovers),
    free_throw_factor = free_throws_made/field_goals_attempted,
    team_rolling_true_shooting_percentage = rollmean(true_shooting_percentage, k=5, align="right", fill=NA),
    team_rolling_turnover_percentage = rollmean(turnover_pct, k=5, align="right", fill=NA),
    team_rolling_free_throw_factor = rollmean(free_throw_factor, k=4, align="right", fill=NA), 
    team_cumulative_mean_true_shooting = cummean(true_shooting_percentage),
    team_cumulative_mean_turnover_percentage = cummean(turnover_pct),
    team_cumulative_mean_free_throw_factor = cummean(free_throw_factor),
    team_cumulative_o_eff = cummean(team_offensive_efficiency),
    team_rolling_o_eff = rollmean(team_offensive_efficiency, k=10, align="right", fill=NA)
  ) %>% ungroup() 

teamstats <- teamstats %>% 
  select(game_id, team_id, team_offensive_efficiency) %>%
  mutate(team_id = as.numeric(team_id)) %>% 
  rename(opponent_id = team_id, opponent_offensive_efficiency=team_offensive_efficiency) %>% 
  left_join(teamstats) %>%
  group_by(team_short_display_name, season) %>%
  arrange(game_date) %>%
  mutate(
    team_cumulative_d_eff = cummean(opponent_offensive_efficiency),
    team_rolling_d_eff = rollmean(opponent_offensive_efficiency, k=10, align="right", fill=NA)
    ) %>% ungroup() %>%
  rename(
    team_sos = OverallSOS,
    team_srs = OverallSRS, 
    team_luck = luck
  )

opponent <- teamstats %>% select(game_id, team_id, offensive_rebounds, defensive_rebounds) %>% rename(opponent_id=team_id, opponent_offensive_rebounds = offensive_rebounds, opponent_defensive_rebounds=defensive_rebounds) %>% mutate(opponent_id = as.numeric(opponent_id))

newteamstats <- teamstats %>% 
  inner_join(opponent) %>% 
  mutate(
    orb = offensive_rebounds / (offensive_rebounds + opponent_defensive_rebounds),
    drb = defensive_rebounds / (opponent_offensive_rebounds + defensive_rebounds),
    team_rolling_orb = rollmean(orb, k=5, align="right", fill=NA),
    team_rolling_drb = rollmean(drb, k=5, align="right", fill=NA),
    team_cumulative_mean_orb = cummean(orb),
    team_cumulative_mean_drb = cummean(drb),
    team_efficiency_margin = team_cumulative_o_eff - team_cumulative_d_eff,
    team_recent_efficiency_margin = team_rolling_o_eff - team_rolling_d_eff,
    team_recency = team_recent_efficiency_margin - team_efficiency_margin
    )

team_side <- newteamstats %>%
  select(game_id, team_id, team_short_display_name, opponent_id, game_date, season, team_score, team_rolling_true_shooting_percentage, team_rolling_free_throw_factor, team_rolling_turnover_percentage, team_rolling_orb, team_rolling_drb, team_cumulative_mean_true_shooting, team_cumulative_mean_turnover_percentage, team_cumulative_mean_free_throw_factor, team_cumulative_mean_orb, team_cumulative_mean_drb, team_cumulative_o_eff, team_cumulative_d_eff, team_efficiency_margin, team_rolling_o_eff, team_rolling_d_eff, team_sos, team_srs, team_luck, team_recency) %>% na.omit()

opponent_side <- newteamstats %>%
  select(game_id, team_id, team_short_display_name, team_score, team_rolling_true_shooting_percentage, team_rolling_free_throw_factor, team_rolling_turnover_percentage, team_rolling_orb, team_rolling_drb, team_cumulative_mean_true_shooting, team_cumulative_mean_turnover_percentage, team_cumulative_mean_free_throw_factor, team_cumulative_mean_orb, team_cumulative_mean_drb, team_cumulative_o_eff, team_cumulative_d_eff, team_efficiency_margin, team_rolling_o_eff, team_rolling_d_eff, team_sos, team_srs, team_luck, team_recency) %>% na.omit() %>%
  rename(
    opponent_id = team_id,
    opponent_short_display_name = team_short_display_name,
    opponent_score = team_score,
    opponent_rolling_true_shooting_percentage = team_rolling_true_shooting_percentage,
    opponent_rolling_free_throw_factor = team_rolling_free_throw_factor,
    opponent_rolling_turnover_percentage = team_rolling_turnover_percentage,
    opponent_rolling_orb = team_rolling_orb,
    opponent_rolling_drb = team_rolling_drb,
    opponent_cumulative_mean_true_shooting = team_cumulative_mean_true_shooting,
    opponent_cumulative_mean_turnover_percentage = team_cumulative_mean_turnover_percentage,
    opponent_cumulative_mean_free_throw_factor = team_cumulative_mean_free_throw_factor,
    opponent_cumulative_mean_orb = team_cumulative_mean_orb,
    opponent_cumulative_mean_drb = team_cumulative_mean_drb,
    opponent_cumulative_o_eff = team_cumulative_o_eff,
    opponent_cumulative_d_eff = team_cumulative_d_eff,
    opponent_efficiency_margin = team_efficiency_margin,
    opponent_rolling_o_eff = team_rolling_o_eff, 
    opponent_rolling_d_eff = team_rolling_d_eff,
    opponent_srs = team_srs,
    opponent_sos = team_sos,
    opponent_luck = team_luck,
    opponent_recency = team_recency
  ) %>%
  mutate(
    opponent_id = as.numeric(opponent_id)
    )

games <- team_side %>% inner_join(opponent_side) %>% mutate(
  TeamResult = as.factor(case_when(
    team_score > opponent_score ~ "W",
    opponent_score > team_score ~ "L"
))) %>% na.omit()

games$TeamResult <- relevel(games$TeamResult, ref="W")

cumulativesimplemodelgames <- games %>% select(game_id, game_date, team_short_display_name, opponent_short_display_name, season, opponent_efficiency_margin, team_efficiency_margin, team_sos, team_srs, opponent_sos, opponent_srs, team_luck, opponent_luck, team_recency, opponent_recency, TeamResult) 

westround1games <- tibble(
  team_short_display_name="Gonzaga",
  opponent_short_display_name="Georgia State"
) %>% add_row(
  team_short_display_name="Boise State",
  opponent_short_display_name="Memphis"
) %>% add_row(
  team_short_display_name="UConn",
  opponent_short_display_name="New Mexico St"
) %>% add_row(
  team_short_display_name="Arkansas",
  opponent_short_display_name="Vermont"
) %>% add_row(
  team_short_display_name="Alabama",
  opponent_short_display_name="Notre Dame"
) %>% add_row(
  team_short_display_name="Texas Tech",
  opponent_short_display_name="Montana State"
) %>% add_row(
  team_short_display_name="Michigan State",
  opponent_short_display_name="Davidson"
) %>% add_row(
  team_short_display_name="Duke",
  opponent_short_display_name="CSU Fullerton"
)

westround1games <- cumulativesimplemodelgames %>% group_by(team_short_display_name) %>% filter(game_date == max(game_date) & season == 2022) %>% slice(1) %>% select(-TeamResult, -starts_with("opponent")) %>% right_join(westround1games)

westround1games <- cumulativesimplemodelgames %>% group_by(opponent_short_display_name) %>% filter(game_date == max(game_date) & season == 2022) %>% slice(1) %>% ungroup() %>% select(-TeamResult, -starts_with("team"), -game_id, -game_date, -season) %>% right_join(westround1games) 

westround1log <- log_fit %>% predict(new_data = westround1games) %>%
  bind_cols(westround1games) %>% select(.pred_class, team_short_display_name, opponent_short_display_name, everything())

westround1log <- log_fit %>% predict(new_data = westround1log, type="prob") %>%
  bind_cols(westround1log) %>% select(.pred_class, .pred_W, .pred_L, team_short_display_name, opponent_short_display_name, everything())

westround2games <- tibble(
  team_short_display_name="Gonzaga",
  opponent_short_display_name="Memphis"
) %>% add_row(
  team_short_display_name="UConn",
  opponent_short_display_name="Arkansas"
) %>% add_row(
  team_short_display_name="Alabama",
  opponent_short_display_name="Texas Tech"
) %>% add_row(
  team_short_display_name="Michigan State",
  opponent_short_display_name="Duke"
)

westround2games <- cumulativesimplemodelgames %>% group_by(team_short_display_name) %>% filter(game_date == max(game_date) & season == 2022) %>% slice(1) %>% select(-TeamResult, -starts_with("opponent")) %>% right_join(westround2games)

westround2games <- cumulativesimplemodelgames %>% group_by(opponent_short_display_name) %>% filter(game_date == max(game_date) & season == 2022) %>% slice(1) %>% ungroup() %>% select(-TeamResult, -starts_with("team"), -game_id, -game_date, -season) %>% right_join(westround2games) 

westround2log <- log_fit %>% predict(new_data = westround2games) %>%
  bind_cols(westround2games) %>% select(.pred_class, team_short_display_name, opponent_short_display_name, everything())

westround2log <- log_fit %>% predict(new_data = westround2log, type="prob") %>%
  bind_cols(westround2log) %>% select(.pred_class, .pred_W, .pred_L, team_short_display_name, opponent_short_display_name, everything())

westround3games <- tibble(
  team_short_display_name="Gonzaga",
  opponent_short_display_name="Arkansas"
) %>% add_row(
  team_short_display_name="Alabama",
  opponent_short_display_name="Duke"
) 

westround3games <- cumulativesimplemodelgames %>% group_by(team_short_display_name) %>% filter(game_date == max(game_date) & season == 2022) %>% slice(1) %>% select(-TeamResult, -starts_with("opponent")) %>% right_join(westround3games)

westround3games <- cumulativesimplemodelgames %>% group_by(opponent_short_display_name) %>% filter(game_date == max(game_date) & season == 2022) %>% slice(1) %>% ungroup() %>% select(-TeamResult, -starts_with("team"), -game_id, -game_date, -season) %>% right_join(westround3games) 

westround3log <- log_fit %>% predict(new_data = westround3games) %>%
  bind_cols(westround3games) %>% select(.pred_class, team_short_display_name, opponent_short_display_name, everything())

westround3log <- log_fit %>% predict(new_data = westround3log, type="prob") %>%
  bind_cols(westround3log) %>% select(.pred_class, .pred_W, .pred_L, team_short_display_name, opponent_short_display_name, everything())

westround4games <- tibble(
  team_short_display_name="Gonzaga",
  opponent_short_display_name="Alabama"
) 

westround4games <- cumulativesimplemodelgames %>% group_by(team_short_display_name) %>% filter(game_date == max(game_date) & season == 2022) %>% slice(1) %>% select(-TeamResult, -starts_with("opponent")) %>% right_join(westround4games)

westround4games <- cumulativesimplemodelgames %>% group_by(opponent_short_display_name) %>% filter(game_date == max(game_date) & season == 2022) %>% slice(1) %>% ungroup() %>% select(-TeamResult, -starts_with("team"), -game_id, -game_date, -season) %>% right_join(westround4games) 

westround4log <- log_fit %>% predict(new_data = westround4games) %>%
  bind_cols(westround4games) %>% select(.pred_class, team_short_display_name, opponent_short_display_name, everything())

westround4log <- log_fit %>% predict(new_data = westround4log, type="prob") %>%
  bind_cols(westround4log) %>% select(.pred_class, .pred_W, .pred_L, team_short_display_name, opponent_short_display_name, everything())

eastround1games <- tibble(
  team_short_display_name="Baylor",
  opponent_short_display_name="Norfolk State"
) %>% add_row(
  team_short_display_name="North Carolina",
  opponent_short_display_name="Marquette"
) %>% add_row(
  team_short_display_name="Saint Mary's",
  opponent_short_display_name="Indiana"
) %>% add_row(
  team_short_display_name="UCLA",
  opponent_short_display_name="Akron"
) %>% add_row(
  team_short_display_name="Texas",
  opponent_short_display_name="Virginia Tech"
) %>% add_row(
  team_short_display_name="Purdue",
  opponent_short_display_name="Yale"
) %>% add_row(
  team_short_display_name="Murray State",
  opponent_short_display_name="San Francisco"
) %>% add_row(
  team_short_display_name="Kentucky",
  opponent_short_display_name="Saint Peter's"
)

eastround1games <- cumulativesimplemodelgames %>% group_by(team_short_display_name) %>% filter(game_date == max(game_date) & season == 2022) %>% slice(1) %>% select(-TeamResult, -starts_with("opponent")) %>% right_join(eastround1games)

eastround1games <- cumulativesimplemodelgames %>% group_by(opponent_short_display_name) %>% filter(game_date == max(game_date) & season == 2022) %>% slice(1) %>% ungroup() %>% select(-TeamResult, -starts_with("team"), -game_id, -game_date, -season) %>% right_join(eastround1games) 

eastround1log <- log_fit %>% predict(new_data = eastround1games) %>%
  bind_cols(eastround1games) %>% select(.pred_class, team_short_display_name, opponent_short_display_name, everything())

eastround1log <- log_fit %>% predict(new_data = eastround1log, type="prob") %>%
  bind_cols(eastround1log) %>% select(.pred_class, .pred_W, .pred_L, team_short_display_name, opponent_short_display_name, everything())

eastround2games <- tibble(
  team_short_display_name="Baylor",
  opponent_short_display_name="North Carolina"
) %>% add_row(
  team_short_display_name="Indiana",
  opponent_short_display_name="UCLA"
)  %>% add_row(
  team_short_display_name="Texas",
  opponent_short_display_name="Purdue"
) %>% add_row(
  team_short_display_name="San Francisco",
  opponent_short_display_name="Kentucky"
) 

eastround2games <- cumulativesimplemodelgames %>% group_by(team_short_display_name) %>% filter(game_date == max(game_date) & season == 2022) %>% slice(1) %>% select(-TeamResult, -starts_with("opponent")) %>% right_join(eastround2games)

eastround2games <- cumulativesimplemodelgames %>% group_by(opponent_short_display_name) %>% filter(game_date == max(game_date) & season == 2022) %>% slice(1) %>% ungroup() %>% select(-TeamResult, -starts_with("team"), -game_id, -game_date, -season) %>% right_join(eastround2games) 

eastround2log <- log_fit %>% predict(new_data = eastround2games) %>%
  bind_cols(eastround2games) %>% select(.pred_class, team_short_display_name, opponent_short_display_name, everything())

eastround2log <- log_fit %>% predict(new_data = eastround2log, type="prob") %>%
  bind_cols(eastround2log) %>% select(.pred_class, .pred_W, .pred_L, team_short_display_name, opponent_short_display_name, everything())

eastround3games <- tibble(
  team_short_display_name="Baylor",
  opponent_short_display_name="UCLA"
) %>% add_row(
  team_short_display_name="Purdue",
  opponent_short_display_name="Kentucky"
) 

eastround3games <- cumulativesimplemodelgames %>% group_by(team_short_display_name) %>% filter(game_date == max(game_date) & season == 2022) %>% slice(1) %>% select(-TeamResult, -starts_with("opponent")) %>% right_join(eastround3games)

eastround3games <- cumulativesimplemodelgames %>% group_by(opponent_short_display_name) %>% filter(game_date == max(game_date) & season == 2022) %>% slice(1) %>% ungroup() %>% select(-TeamResult, -starts_with("team"), -game_id, -game_date, -season) %>% right_join(eastround3games) 

eastround3log <- log_fit %>% predict(new_data = eastround3games) %>%
  bind_cols(eastround3games) %>% select(.pred_class, team_short_display_name, opponent_short_display_name, everything())

eastround3log <- log_fit %>% predict(new_data = eastround3log, type="prob") %>%
  bind_cols(eastround3log) %>% select(.pred_class, .pred_W, .pred_L, team_short_display_name, opponent_short_display_name, everything())

eastround4games <- tibble(
  team_short_display_name="Baylor",
  opponent_short_display_name="Purdue"
) 

eastround4games <- cumulativesimplemodelgames %>% group_by(team_short_display_name) %>% filter(game_date == max(game_date) & season == 2022) %>% slice(1) %>% select(-TeamResult, -starts_with("opponent")) %>% right_join(eastround4games)

eastround4games <- cumulativesimplemodelgames %>% group_by(opponent_short_display_name) %>% filter(game_date == max(game_date) & season == 2022) %>% slice(1) %>% ungroup() %>% select(-TeamResult, -starts_with("team"), -game_id, -game_date, -season) %>% right_join(eastround4games) 

eastround4log <- log_fit %>% predict(new_data = eastround4games) %>%
  bind_cols(eastround4games) %>% select(.pred_class, team_short_display_name, opponent_short_display_name, everything())

eastround4log <- log_fit %>% predict(new_data = eastround4log, type="prob") %>%
  bind_cols(eastround4log) %>% select(.pred_class, .pred_W, .pred_L, team_short_display_name, opponent_short_display_name, everything())

southround1games <- tibble(
  team_short_display_name="Arizona",
  opponent_short_display_name="Wright State"
) %>% add_row(
  team_short_display_name="Seton Hall",
  opponent_short_display_name="TCU"
) %>% add_row(
  team_short_display_name="Houston",
  opponent_short_display_name="UAB"
) %>% add_row(
  team_short_display_name="Illinois",
  opponent_short_display_name="Chattanooga"
) %>% add_row(
  team_short_display_name="Colorado State",
  opponent_short_display_name="Michigan"
) %>% add_row(
  team_short_display_name="Tennessee",
  opponent_short_display_name="Longwood"
) %>% add_row(
  team_short_display_name="Ohio State",
  opponent_short_display_name="Loyola Chicago"
) %>% add_row(
  team_short_display_name="Villanova",
  opponent_short_display_name="Delaware"
)

southround1games <- cumulativesimplemodelgames %>% group_by(team_short_display_name) %>% filter(game_date == max(game_date) & season == 2022) %>% slice(1) %>% select(-TeamResult, -starts_with("opponent")) %>% right_join(southround1games)

southround1games <- cumulativesimplemodelgames %>% group_by(opponent_short_display_name) %>% filter(game_date == max(game_date) & season == 2022) %>% slice(1) %>% ungroup() %>% select(-TeamResult, -starts_with("team"), -game_id, -game_date, -season) %>% right_join(southround1games) 

southround1log <- log_fit %>% predict(new_data = southround1games) %>%
  bind_cols(southround1games) %>% select(.pred_class, team_short_display_name, opponent_short_display_name, everything())

southround1log <- log_fit %>% predict(new_data = southround1log, type="prob") %>%
  bind_cols(southround1log) %>% select(.pred_class, .pred_W, .pred_L, team_short_display_name, opponent_short_display_name, everything())

southround2games <- tibble(
  team_short_display_name="Arizona",
  opponent_short_display_name="Seton Hall"
) %>% add_row(
  team_short_display_name="Houston",
  opponent_short_display_name="Illinois"
) %>% add_row(
  team_short_display_name="Michigan",
  opponent_short_display_name="Tennessee"
) %>% add_row(
  team_short_display_name="Ohio State",
  opponent_short_display_name="Villanova"
)

southround2games <- cumulativesimplemodelgames %>% group_by(team_short_display_name) %>% filter(game_date == max(game_date) & season == 2022) %>% slice(1) %>% select(-TeamResult, -starts_with("opponent")) %>% right_join(southround2games)

southround2games <- cumulativesimplemodelgames %>% group_by(opponent_short_display_name) %>% filter(game_date == max(game_date) & season == 2022) %>% slice(1) %>% ungroup() %>% select(-TeamResult, -starts_with("team"), -game_id, -game_date, -season) %>% right_join(southround2games) 

southround2log <- log_fit %>% predict(new_data = southround2games) %>%
  bind_cols(southround2games) %>% select(.pred_class, team_short_display_name, opponent_short_display_name, everything())

southround2log <- log_fit %>% predict(new_data = southround2log, type="prob") %>%
  bind_cols(southround2log) %>% select(.pred_class, .pred_W, .pred_L, team_short_display_name, opponent_short_display_name, everything())

southround3games <- tibble(
  team_short_display_name="Arizona",
  opponent_short_display_name="Illinois"
) %>% add_row(
  team_short_display_name="Michigan",
  opponent_short_display_name="Villanova"
)

southround3games <- cumulativesimplemodelgames %>% group_by(team_short_display_name) %>% filter(game_date == max(game_date) & season == 2022) %>% slice(1) %>% select(-TeamResult, -starts_with("opponent")) %>% right_join(southround3games)

southround3games <- cumulativesimplemodelgames %>% group_by(opponent_short_display_name) %>% filter(game_date == max(game_date) & season == 2022) %>% slice(1) %>% ungroup() %>% select(-TeamResult, -starts_with("team"), -game_id, -game_date, -season) %>% right_join(southround3games) 

southround3log <- log_fit %>% predict(new_data = southround3games) %>%
  bind_cols(southround3games) %>% select(.pred_class, team_short_display_name, opponent_short_display_name, everything())

southround3log <- log_fit %>% predict(new_data = southround3log, type="prob") %>%
  bind_cols(southround3log) %>% select(.pred_class, .pred_W, .pred_L, team_short_display_name, opponent_short_display_name, everything())

southround4games <- tibble(
  team_short_display_name="Michigan",
  opponent_short_display_name="Illinois"
) 

southround4games <- cumulativesimplemodelgames %>% group_by(team_short_display_name) %>% filter(game_date == max(game_date) & season == 2022) %>% slice(1) %>% select(-TeamResult, -starts_with("opponent")) %>% right_join(southround4games)

southround4games <- cumulativesimplemodelgames %>% group_by(opponent_short_display_name) %>% filter(game_date == max(game_date) & season == 2022) %>% slice(1) %>% ungroup() %>% select(-TeamResult, -starts_with("team"), -game_id, -game_date, -season) %>% right_join(southround4games) 

southround4log <- log_fit %>% predict(new_data = southround4games) %>%
  bind_cols(southround4games) %>% select(.pred_class, team_short_display_name, opponent_short_display_name, everything())

southround4log <- log_fit %>% predict(new_data = southround4log, type="prob") %>%
  bind_cols(southround4log) %>% select(.pred_class, .pred_W, .pred_L, team_short_display_name, opponent_short_display_name, everything())

midwestround1games <- tibble(
  team_short_display_name="Kansas",
  opponent_short_display_name="Texas Southern"
) %>% add_row(
  team_short_display_name="San Diego State",
  opponent_short_display_name="Creighton"
) %>% add_row(
  team_short_display_name="Iowa",
  opponent_short_display_name="Richmond"
) %>% add_row(
  team_short_display_name="Providence",
  opponent_short_display_name="S Dakota St"
) %>% add_row(
  team_short_display_name="LSU",
  opponent_short_display_name="Iowa State"
) %>% add_row(
  team_short_display_name="Wisconsin",
  opponent_short_display_name="Colgate"
) %>% add_row(
  team_short_display_name="USC",
  opponent_short_display_name="Miami"
) %>% add_row(
  team_short_display_name="Auburn",
  opponent_short_display_name="J'Ville St"
)

midwestround1games <- cumulativesimplemodelgames %>% group_by(team_short_display_name) %>% filter(game_date == max(game_date) & season == 2022) %>% slice(1) %>% select(-TeamResult, -starts_with("opponent")) %>% right_join(midwestround1games)

midwestround1games <- cumulativesimplemodelgames %>% group_by(opponent_short_display_name) %>% filter(game_date == max(game_date) & season == 2022) %>% slice(1) %>% ungroup() %>% select(-TeamResult, -starts_with("team"), -game_id, -game_date, -season) %>% right_join(midwestround1games) 

midwestround1log <- log_fit %>% predict(new_data = midwestround1games) %>%
  bind_cols(midwestround1games) %>% select(.pred_class, team_short_display_name, opponent_short_display_name, everything())

midwestround1log <- log_fit %>% predict(new_data = midwestround1log, type="prob") %>%
  bind_cols(midwestround1log) %>% select(.pred_class, .pred_W, .pred_L, team_short_display_name, opponent_short_display_name, everything())

midwestround2games <- tibble(
  team_short_display_name="Kansas",
  opponent_short_display_name="Creighton"
) %>% add_row(
  team_short_display_name="Iowa",
  opponent_short_display_name="Providence"
) %>% add_row(
  team_short_display_name="LSU",
  opponent_short_display_name="Wisconsin"
) %>% add_row(
  team_short_display_name="USC",
  opponent_short_display_name="Auburn"
) 

midwestround2games <- cumulativesimplemodelgames %>% group_by(team_short_display_name) %>% filter(game_date == max(game_date) & season == 2022) %>% slice(1) %>% select(-TeamResult, -starts_with("opponent")) %>% right_join(midwestround2games)

midwestround2games <- cumulativesimplemodelgames %>% group_by(opponent_short_display_name) %>% filter(game_date == max(game_date) & season == 2022) %>% slice(1) %>% ungroup() %>% select(-TeamResult, -starts_with("team"), -game_id, -game_date, -season) %>% right_join(midwestround2games) 

midwestround2log <- log_fit %>% predict(new_data = midwestround2games) %>%
  bind_cols(midwestround2games) %>% select(.pred_class, team_short_display_name, opponent_short_display_name, everything())

midwestround2log <- log_fit %>% predict(new_data = midwestround2log, type="prob") %>%
  bind_cols(midwestround2log) %>% select(.pred_class, .pred_W, .pred_L, team_short_display_name, opponent_short_display_name, everything())

midwestround3games <- tibble(
  team_short_display_name="Creighton",
  opponent_short_display_name="Iowa"
) %>% add_row(
  team_short_display_name="Wisconsin",
  opponent_short_display_name="USC"
)

midwestround3games <- cumulativesimplemodelgames %>% group_by(team_short_display_name) %>% filter(game_date == max(game_date) & season == 2022) %>% slice(1) %>% select(-TeamResult, -starts_with("opponent")) %>% right_join(midwestround3games)

midwestround3games <- cumulativesimplemodelgames %>% group_by(opponent_short_display_name) %>% filter(game_date == max(game_date) & season == 2022) %>% slice(1) %>% ungroup() %>% select(-TeamResult, -starts_with("team"), -game_id, -game_date, -season) %>% right_join(midwestround3games) 

midwestround3log <- log_fit %>% predict(new_data = midwestround3games) %>%
  bind_cols(midwestround3games) %>% select(.pred_class, team_short_display_name, opponent_short_display_name, everything())

midwestround3log <- log_fit %>% predict(new_data = midwestround3log, type="prob") %>%
  bind_cols(midwestround3log) %>% select(.pred_class, .pred_W, .pred_L, team_short_display_name, opponent_short_display_name, everything())

midwestround4games <- tibble(
  team_short_display_name="Creighton",
  opponent_short_display_name="USC"
)

midwestround4games <- cumulativesimplemodelgames %>% group_by(team_short_display_name) %>% filter(game_date == max(game_date) & season == 2022) %>% slice(1) %>% select(-TeamResult, -starts_with("opponent")) %>% right_join(midwestround4games)

midwestround4games <- cumulativesimplemodelgames %>% group_by(opponent_short_display_name) %>% filter(game_date == max(game_date) & season == 2022) %>% slice(1) %>% ungroup() %>% select(-TeamResult, -starts_with("team"), -game_id, -game_date, -season) %>% right_join(midwestround4games) 

midwestround4log <- log_fit %>% predict(new_data = midwestround4games) %>%
  bind_cols(midwestround4games) %>% select(.pred_class, team_short_display_name, opponent_short_display_name, everything())

midwestround4log <- log_fit %>% predict(new_data = midwestround4log, type="prob") %>%
  bind_cols(midwestround4log) %>% select(.pred_class, .pred_W, .pred_L, team_short_display_name, opponent_short_display_name, everything())

finalfourgames <- tibble(
  team_short_display_name="Gonzaga",
  opponent_short_display_name="Baylor"
) %>% add_row(
  team_short_display_name="Illinois",
  opponent_short_display_name="USC"  
)

finalfourgames <- cumulativesimplemodelgames %>% group_by(team_short_display_name) %>% filter(game_date == max(game_date) & season == 2022) %>% slice(1) %>% select(-TeamResult, -starts_with("opponent")) %>% right_join(finalfourgames)

finalfourgames <- cumulativesimplemodelgames %>% group_by(opponent_short_display_name) %>% filter(game_date == max(game_date) & season == 2022) %>% slice(1) %>% ungroup() %>% select(-TeamResult, -starts_with("team"), -game_id, -game_date, -season) %>% right_join(finalfourgames) 

finalfourlog <- log_fit %>% predict(new_data = finalfourgames) %>%
  bind_cols(finalfourgames) %>% select(.pred_class, team_short_display_name, opponent_short_display_name, everything())

finalfourlog <- log_fit %>% predict(new_data = finalfourlog, type="prob") %>%
  bind_cols(finalfourlog) %>% select(.pred_class, .pred_W, .pred_L, team_short_display_name, opponent_short_display_name, everything())

champs <- tibble(
  team_short_display_name="Baylor",
  opponent_short_display_name="Illinois"
) 

champs <- cumulativesimplemodelgames %>% group_by(team_short_display_name) %>% filter(game_date == max(game_date) & season == 2022) %>% slice(1) %>% select(-TeamResult, -starts_with("opponent")) %>% right_join(champs)

champs <- cumulativesimplemodelgames %>% group_by(opponent_short_display_name) %>% filter(game_date == max(game_date) & season == 2022) %>% slice(1) %>% ungroup() %>% select(-TeamResult, -starts_with("team"), -game_id, -game_date, -season) %>% right_join(champs) 

champslog <- log_fit %>% predict(new_data = champs) %>%
  bind_cols(champs) %>% select(.pred_class, team_short_display_name, opponent_short_display_name, everything())

champslog <- log_fit %>% predict(new_data = champslog, type="prob") %>%
  bind_cols(champslog) %>% select(.pred_class, .pred_W, .pred_L, team_short_display_name, opponent_short_display_name, everything())

To make my predictors, I borrowed simple ratings and strength of schedule from Sports Reference, I calculated season long unweighted offensive and defensive efficiency margins, threw in KenPom’s luck metric and, at the last minute, decided to create a recency bias metric. What I did was compare the season-long efficiency margin of teams to the same metric from their last 10 games. If they were overplaying their season numbers, that gave them a positive measure. I was trying to capture teams who came into the tournament hot, or who were limping into the tournament based on their regular season resume but missing stars or just playing terrible.

Code
summary(cumulative_recipe) %>%
  select(variable, role) %>% 
  filter(role != "ID") %>%
  gt() %>%
  tab_header(
    title = "The predictors",
    subtitle = "A mix of efficiencies, ratings and recency bias."
  ) %>%  
  tab_source_note(
    source_note = md("**By:** Matt Waite")
  ) %>% 
  tab_style(
    style = cell_text(color = "black", weight = "bold", align = "left"),
    locations = cells_title("title")
  ) %>% 
  tab_style(
    style = cell_text(color = "black", align = "left"),
    locations = cells_title("subtitle")
  ) %>%
  tab_style(
     locations = cells_column_labels(columns = everything()),
     style = list(
       cell_borders(sides = "bottom", weight = px(3)),
       cell_text(weight = "bold", size=12)
     )
   ) %>%
  opt_row_striping() %>% 
  opt_table_lines("none")
The predictors
A mix of efficiencies, ratings and recency bias.
variable role
opponent_efficiency_margin predictor
team_efficiency_margin predictor
team_sos predictor
team_srs predictor
opponent_sos predictor
opponent_srs predictor
opponent_luck predictor
team_luck predictor
opponent_recency predictor
team_recency predictor
TeamResult outcome
By: Matt Waite

In the run up to the tournament, I used both a logistic regression and support vector machine algorithm, but got very similar results, so I stuck with the more simple logistic regression.

In testing, my model was calling college basketball games correctly about 74 percent of the time, so I knew I was going to need to get lucky on a few games. But isn’t that filling out a bracket?

Result? I did not get lucky.

Particularly in the Midwest Regional. My models labored to produce a Creighton vs USC Elite Eight match-up that sent USC to the Final Four. While Creighton nearly upset Kansas with half a lineup, USC bombed out in the first round. The team that beat them ended up going to the Elite Eight – Miami – taking out other teams I had predicted to win along the way.

Here’s what my model predicted in the Midwest Regional:

Code
midwestround1log %>% 
  select(team_short_display_name, .pred_class, .pred_W, opponent_short_display_name) %>%
  gt() %>% 
  cols_label(
    team_short_display_name = "Team",
    .pred_class = "Prediction",
    .pred_W = "Win Confidence",
    opponent_short_display_name = "Opponent"
  ) %>%
  tab_header(
    title = "Midwest Regional: Round 1",
    subtitle = "I was sure Iowa State would lose and USC was Final Four bound."
  ) %>%  
  tab_source_note(
    source_note = md("**By:** Matt Waite")
  ) %>% 
  tab_style(
    style = cell_text(color = "black", weight = "bold", align = "left"),
    locations = cells_title("title")
  ) %>% 
  tab_style(
    style = cell_text(color = "black", align = "left"),
    locations = cells_title("subtitle")
  ) %>%
  tab_style(
     locations = cells_column_labels(columns = everything()),
     style = list(
       cell_borders(sides = "bottom", weight = px(3)),
       cell_text(weight = "bold", size=12)
     )
   ) %>%
  opt_row_striping() %>% 
  opt_table_lines("none") %>%
    fmt_percent(
    columns = c(.pred_W),
    decimals = 1
  )
Midwest Regional: Round 1
I was sure Iowa State would lose and USC was Final Four bound.
Team Prediction Win Confidence Opponent
Wisconsin W 86.5% Colgate
San Diego State L 30.9% Creighton
LSU W 88.0% Iowa State
Auburn W 89.2% J'Ville St
USC W 92.9% Miami
Iowa W 82.4% Richmond
Providence W 82.4% S Dakota St
Kansas W 97.6% Texas Southern
By: Matt Waite

Normally I could survive an Iowa State and Miami win here … except I had both of their opponents moving on fairly deep.

Round two was a complete disaster.

Code
midwestround2log %>% 
  select(team_short_display_name, .pred_class, .pred_W, opponent_short_display_name) %>%
  gt() %>% 
  cols_label(
    team_short_display_name = "Team",
    .pred_class = "Prediction",
    .pred_W = "Win Confidence",
    opponent_short_display_name = "Opponent"
  ) %>%
  tab_header(
    title = "Midwest Regional: Round 2",
    subtitle = "Not one of these predictions were correct."
  ) %>%  
  tab_source_note(
    source_note = md("**By:** Matt Waite")
  ) %>% 
  tab_style(
    style = cell_text(color = "black", weight = "bold", align = "left"),
    locations = cells_title("title")
  ) %>% 
  tab_style(
    style = cell_text(color = "black", align = "left"),
    locations = cells_title("subtitle")
  ) %>%
  tab_style(
     locations = cells_column_labels(columns = everything()),
     style = list(
       cell_borders(sides = "bottom", weight = px(3)),
       cell_text(weight = "bold", size=12)
     )
   ) %>%
  opt_row_striping() %>% 
  opt_table_lines("none") %>%
    fmt_percent(
    columns = c(.pred_W),
    decimals = 1
  )
Midwest Regional: Round 2
Not one of these predictions were correct.
Team Prediction Win Confidence Opponent
USC W 88.8% Auburn
Kansas L 39.0% Creighton
Iowa W 63.0% Providence
LSU L 23.3% Wisconsin
By: Matt Waite

My models ended up with a Gonzaga v Baylor and Illinois v USC final four. That 0-4 on those. None of them made it. The Baylor repeat that I predicted died in the second round at the hands of North Carolina, the highest seeded team to make the Final Four.

Going into the Final Four, I’m in the 16th percentile of ESPN brackets, good enough for 14.6 millionth place. Last year, I was in the 38th percentile.

The best bracket in my class is in the 88th percentile, with another in the 86th. Only two students did worse than I did.

We are what they grow beyond.

Thinking about this bracket, I wanted to try rolling with something that didn’t just bite KenPom and make models out of his data. I wanted to see if I could get to a similar place without re-walking the same ground. I’ve got a year to work in this, but my energy is going to be focused on weighting competition and opponents throughout the season. If of two minds about this: As a Big Ten denizen, I have to wonder if beating up on each other for a whole season is why the Big Ten fades in the tournament. So I’m curious about a Fatigue Factor of some variety. At the same time, how does that explain St. Peter’s? Not sure it does, but I’m not sure there’s a model anywhere that’s going to.

The code I wrote to make this relied heavily on hoopR and tidymodels.