library(tidyverse)
library(cfbfastR)
39 Using packages to get data
There is a growing number of packages and repositories of sports data, largely because there’s a growing number of people who want to analyze that data. We’ve done it ourselves with simple Google Sheets tricks. Then there’s RVest, which is a method of scraping the data yourself from websites. But with these packages, someone has done the work of gathering the data for you. All you have to learn are the commands to get it.
One very promising collection of libraries is something called the SportsDataverse, which has a collection of packages covering specific sports, all of which are in various stages of development. Some are more complete than others, but they are all being actively worked on by developers. Packages of interest in this class are:
- cfbfastR, for college football.
- hoopR, for men’s professional and college basketball.
- wehoop, for women’s professional and college basketball.
- baseballr, for professional and college baseball.
- worldfootballR, for soccer data from around the world.
- hockeyR, for NHL hockey data
- recruitR, for college sports recruiting
Not part of the SportsDataverse, but in the same neighborhood, is nflfastR, which can provide NFL play-by-play data.
Because they’re all under development, not all of them can be installed with just a simple install.packages("something")
. Some require a little work, some require API keys.
The main issue for you is to read the documentation carefully.
39.1 Using cfbfastR as a cautionary tale
cfbfastR presents us a good view into the promise and peril of libraries like this.
First, to make this work, follow the installation instructions and then follow how to get an API key from College Football Data and how to add that to your environment. But maybe wait to do that until you read the whole section.
After installations, we can load it up.
You might be thinking, “Oh wow, I can get play by play data for college football. Let’s look at what are the five most heartbreaking plays of this doomed Nebraska season.” Because what better way to determine doom than by looking at the steepest dropoff in win probability, which is included in the data.
Great idea. Let’s do it.
The first thing to do is read the documentation. You’ll see that you can request data for each week. For example, here’s week 2, which is actually Nebraska’s third game (the week 0 game is lumped in with week 1).
<- cfbd_pbp_data(
nebraska 2021,
week=2,
season_type = "regular",
team = "Nebraska",
epa_wpa = TRUE,
)
• 09:46:54 | Start processing of 1 game...
There’s not an easy way to get all of a single team’s games. A way to do it that’s not very pretty but it works is like this:
<- cfbd_pbp_data(2021, week=1, season_type = "regular", team = "Nebraska", epa_wpa = TRUE)
wk1 Sys.sleep(2)
<- cfbd_pbp_data(2021, week=2, season_type = "regular", team = "Nebraska", epa_wpa = TRUE)
wk2 Sys.sleep(2)
<- cfbd_pbp_data(2021, week=3, season_type = "regular", team = "Nebraska", epa_wpa = TRUE)
wk3 Sys.sleep(2)
<- cfbd_pbp_data(2021, week=4, season_type = "regular", team = "Nebraska", epa_wpa = TRUE)
wk4 Sys.sleep(2)
<- cfbd_pbp_data(2021, week=5, season_type = "regular", team = "Nebraska", epa_wpa = TRUE)
wk5 Sys.sleep(2)
<- cfbd_pbp_data(2021, week=6, season_type = "regular", team = "Nebraska", epa_wpa = TRUE)
wk6 Sys.sleep(2)
<- cfbd_pbp_data(2021, week=7, season_type = "regular", team = "Nebraska", epa_wpa = TRUE)
wk7 Sys.sleep(2)
<- cfbd_pbp_data(2021, week=9, season_type = "regular", team = "Nebraska", epa_wpa = TRUE)
wk9 Sys.sleep(2)
<- cfbd_pbp_data(2021, week=10, season_type = "regular", team = "Nebraska", epa_wpa = TRUE)
wk10
<- bind_rows(wk1, wk2, wk3, wk4, wk5, wk6, wk7, wk9, wk10) nuplays
The sys.sleep bits just pauses for two seconds before running the next block. Since we’re requesting data from someone else’s computer, we want to be kind. Week 8 was a bye week for Nebraska, so if you request it, you’ll get an empty request and a warning. The bind_rows
parts puts all the dataframes into a single dataframe.
Now you’re ready to look at heartbreak. How do we define heartbreak? How about like this: you first have to lose the game, it comes in the third or fourth quarter, it involves a play (i.e. not a timeout), and it results in the biggest drop in win probability.
|>
nuplays filter(pos_team == "Nebraska" & def_pos_team != "Fordham" & def_pos_team != "Buffalo" & def_pos_team != "Northwestern" & play_type != "Timeout") |>
filter(period == 3 | period == 4) |>
mutate(HeartbreakLevel = wp_before - wp_after) |>
arrange(desc(HeartbreakLevel)) |>
top_n(5, wt=HeartbreakLevel) |>
select(period, clock.minutes, def_pos_team, play_type, play_text)
── play-by-play data from CollegeFootballData.com ──────────── cfbfastR 1.9.0 ──
ℹ Data updated: 2023-12-27 09:46:57 CST
# A tibble: 5 × 5
period clock.minutes def_pos_team play_type play_text
<int> <int> <chr> <chr> <chr>
1 4 1 Michigan Fumble Recovery (Opponent) Adrian Martine…
2 4 1 Michigan State Punt William Przyst…
3 4 14 Michigan State Sack Adrian Martine…
4 3 10 Purdue Interception Return Adrian Martine…
5 4 3 Michigan State Punt Return Touchdown Daniel Cerni p…
The most heartbreaking play of the season? A fourth quarter fumble against Michigan. Next up: Basically the entire fourth quarter against Michigan State.
39.2 Another example
The wehoop package is mature enough to have a version on CRAN, so you can install it the usual way with install.packages("wehoop")
. Another helpful library to install is progressr with install.packages("progressr")
library(wehoop)
Many of these libraries have more than play-by-play data. For example, wehoop has box scores and player data for both the WNBA and college basketball. From personal experience, WNBA data isn’t hard to get, but women’s college basketball is a giant pain.
So, who is Nebraska’s single season points champion over the last five seasons?
::with_progress({
progressr<- wehoop::load_wbb_player_box(2017:2021)
wbb_player_box })
With progressr, you’ll see a progress bar in the console, which lets you know that your command is still working, since some of these requests take minutes to complete. Player box scores is quicker – five seasons was a matter of seconds.
If you look at the wbb_player_box data we now have, we have each player in each game over each season – more than 300,000 records. Finding out who Nebraska’s top 10 single-season scoring leaders are is a matter of grouping, summing and filtering.
|>
wbb_player_box filter(team_short_display_name == "Nebraska") |>
group_by(athlete_display_name, season) |>
summarise(totalPoints = sum(as.numeric(points))) |>
arrange(desc(totalPoints)) |>
ungroup() |>
top_n(10, wt=totalPoints)
# A tibble: 10 × 3
athlete_display_name season totalPoints
<chr> <int> <dbl>
1 Jessica Shepard 2017 538
2 Sam Haiby 2021 438
3 Leigha Brown 2020 433
4 Hannah Whitish 2018 403
5 Kate Cain 2018 316
6 Sam Haiby 2020 300
7 Hannah Whitish 2019 292
8 Sam Haiby 2019 285
9 Leigha Brown 2019 280
10 Kate Cain 2020 278
This just in: Sam Haiby is good at basketball.