4  Aggregates

R is a statistical programming language that is purpose built for data analysis.

Base R does a lot, but there are a mountain of external libraries that do things to make R better/easier/more fully featured. We already installed the tidyverse – or you should have if you followed the instructions for the last assignment – which isn’t exactly a library, but a collection of libraries. Together, they make up the tidyverse. Individually, they are extraordinarily useful for what they do. We can load them all at once using the tidyverse name, or we can load them individually. Let’s start with individually.

The two libraries we are going to need for this assignment are readr and dplyr. The library readr reads different types of data in as a dataframe. For this assignment, we’re going to read in csv data or Comma Separated Values data. That’s data that has a comma between each column of data.

Then we’re going to use dplyr to analyze it.

To use a library, you need to import it. Good practice – one I’m going to insist on – is that you put all your library steps at the top of your notebook.

That code looks like this:

library(readr)

To load them both, you need to run that code twice:

library(readr)
library(dplyr)

You can keep doing that for as many libraries as you need. I’ve seen notebooks with 10 or more library imports.

But the tidyverse has a neat little trick. We can load most of the libraries we’ll need for the whole semester with one line:

library(tidyverse)

From now on, if that’s not the first line of your notebook, you’re probably doing it wrong.

4.1 Basic data analysis: Group By and Count

The first thing we need to do is get some data to work with. We do that by reading it in. In our case, we’re going to read data from a csv file – a comma-separated values file.

The CSV file we’re going to read from is a Basketball Reference page of advanced metrics for NBA players this season. The Sports Reference sites are a godsend of data, a trove of stuff, and we’re going to use it a lot in this class.

For this walkthrough:

So step 2, after setting up our libraries, is most often going to be importing data. In order to analyze data, we need data, so it stands to reason that this would be something we’d do very early.

The code looks something like this, but hold off copying it just yet:

nbaplayers <- read_csv("~/Box/SportsData/nbaadvancedplayers1920.csv")

Let’s unpack that.

The first part – nbaplayers – is the name of your variable. A variable is just a name of a thing that stores stuff. In this case, our variable is a data frame, which is R’s way of storing data (technically it’s a tibble, which is the tidyverse way of storing data, but the differences aren’t important and people use them interchangeably). We can call this whatever we want. I always want to name data frames after what is in it. In this case, we’re going to import a dataset of NBA players. Variable names, by convention are one word all lower case. You can end a variable with a number, but you can’t start one with a number.

The <- bit is the variable assignment operator. It’s how we know we’re assigning something to a word. Think of the arrow as saying “Take everything on the right of this arrow and stuff it into the thing on the left.” So we’re creating an empty vessel called nbaplayers and stuffing all this data into it.

The read_csv bits are pretty obvious, except for one thing. What happens in the quote marks is the path to the data. In there, I have to tell R where it will find the data. The easiest thing to do, if you are confused about how to find your data, is to put your data in the same folder as as your notebook (you’ll have to save that notebook first). If you do that, then you just need to put the name of the file in there (nbaadvancedplayers1920.csv). In my case, I’ve got a folder called Box in my home directory (that’s the ~ part), and in there is a folder called SportsData that has the file called nbaadvancedplayers1920.csv in it. Some people – insane people – leave the data in their downloads folder. The data path then would be ~/Downloads/nameofthedatafilehere.csv on PC or Mac.

What you put in there will be different from mine. So your first task is to import the data.

nbaplayers <- read_csv("data/nbaadvancedplayers1920.csv")
Rows: 651 Columns: 27
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (3): Player, Pos, Tm
dbl (24): Rk, Age, G, MP, PER, TS%, 3PAr, FTr, ORB%, DRB%, TRB%, AST%, STL%,...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Now we can inspect the data we imported. What does it look like? To do that, we use head(nbaplayers) to show the headers and the first six rows of data. If we wanted to see them all, we could just simply enter mountainlions and run it.

To get the number of records in our dataset, we run nrow(nbaplayers)

head(nbaplayers)
# A tibble: 6 × 27
     Rk Player     Pos     Age Tm        G    MP   PER `TS%` `3PAr`   FTr `ORB%`
  <dbl> <chr>      <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl>  <dbl>
1     1 Steven Ad… C        26 OKC      63  1680  20.5 0.604  0.006 0.421   14  
2     2 Bam Adeba… PF       22 MIA      72  2417  20.3 0.598  0.018 0.484    8.5
3     3 LaMarcus … C        34 SAS      53  1754  19.7 0.571  0.198 0.241    6.3
4     4 Kyle Alex… PF       23 MIA       2    13   4.7 0.5    0     0       17.9
5     5 Nickeil A… SG       21 NOP      47   591   8.9 0.473  0.5   0.139    1.6
6     6 Grayson A… SG       24 MEM      38   718  12   0.609  0.562 0.179    1.2
# ℹ 15 more variables: `DRB%` <dbl>, `TRB%` <dbl>, `AST%` <dbl>, `STL%` <dbl>,
#   `BLK%` <dbl>, `TOV%` <dbl>, `USG%` <dbl>, OWS <dbl>, DWS <dbl>, WS <dbl>,
#   `WS/48` <dbl>, OBPM <dbl>, DBPM <dbl>, BPM <dbl>, VORP <dbl>
nrow(nbaplayers)
[1] 651

Another way to look at nrow – we have 651 players from this season in our dataset.

What if we wanted to know how many players there were by position? To do that by hand, we’d have to take each of the 651 records and sort them into a pile. We’d put them in groups and then count them.

dplyr has a group by function in it that does just this. A massive amount of data analysis involves grouping like things together at some point. So it’s a good place to start.

So to do this, we’ll take our dataset and we’ll introduce a new operator: %>%. The best way to read that operator, in my opinion, is to interpret that as “and then do this.”

After we group them together, we need to count them. We do that first by saying we want to summarize our data (a count is a part of a summary). To get a summary, we have to tell it what we want. So in this case, we want a count. To get that, let’s create a thing called total and set it equal to n(), which is dplyrs way of counting something.

Here’s the code:

nbaplayers |>
  group_by(Pos) |>
  summarise(
    total = n()
  )
# A tibble: 9 × 2
  Pos   total
  <chr> <int>
1 C       111
2 C-PF      2
3 PF      135
4 PF-C      2
5 PG      111
6 SF      113
7 SF-PF     4
8 SF-SG     3
9 SG      170

So let’s walk through that. We start with our dataset – nbaplayers – and then we tell it to group the data by a given field in the data which we get by looking at either the output of head or you can look in the environment where you’ll see nbaplayers.

In this case, we wanted to group together positions, signified by the field name Pos. After we group the data, we need to count them up. In dplyr, we use summarize which can do more than just count things. Inside the parentheses in summarize, we set up the summaries we want. In this case, we just want a count of the positions: total = n(), says create a new field, called total and set it equal to n(), which might look weird, but it’s common in stats. The number of things in a dataset? Statisticians call in n. There are n number of players in this dataset. So n() is a function that counts the number of things there are.

And when we run that, we get a list of positions with a count next to them. But it’s not in any order. So we’ll add another And Then Do This %>% and use arrange. Arrange does what you think it does – it arranges data in order. By default, it’s in ascending order – smallest to largest. But if we want to know the county with the most mountain lion sightings, we need to sort it in descending order. That looks like this:

nbaplayers |>
  group_by(Pos) |>
  summarise(
    total = n()
  ) |> arrange(desc(total))
# A tibble: 9 × 2
  Pos   total
  <chr> <int>
1 SG      170
2 PF      135
3 SF      113
4 C       111
5 PG      111
6 SF-PF     4
7 SF-SG     3
8 C-PF      2
9 PF-C      2

So the most common position in the NBA? Shooting guard, followed by power forward.

We can, if we want, group by more than one thing. Which team has the most of a single position? To do that, we can group by the team – called Tm in the data – and position, or Pos in the data:

nbaplayers |>
  group_by(Tm, Pos) |>
  summarise(
    total = n()
  ) |> arrange(desc(total))
`summarise()` has grouped output by 'Tm'. You can override using the `.groups`
argument.
# A tibble: 159 × 3
# Groups:   Tm [31]
   Tm    Pos   total
   <chr> <chr> <int>
 1 TOT   PF       13
 2 TOT   SG       13
 3 SAC   PF        9
 4 TOT   SF        9
 5 BRK   SG        8
 6 LAL   SG        8
 7 TOT   PG        8
 8 ATL   SG        7
 9 BRK   SF        7
10 DAL   SG        7
# ℹ 149 more rows

So wait, what team is TOT?

Valuable lesson: whoever collects the data has opinions on how to solve problems. In this case, Basketball Reference, when a player get’s traded, records stats for the player’s first team, their second team, and a combined season total for a team called TOT, meaning Total. Is there a team abbreviated TOT? No. So ignore them here.

Sacramento has 9 power forward. Brooklyn has 8 shooting guards, as do the Lakers. You can learn a bit about how a team is assembled by looking at these simple counts.

4.2 Other aggregates: Mean and median

In the last example, we grouped some data together and counted it up, but there’s so much more you can do. You can do multiple measures in a single step as well.

Sticking with our NBA player data, we can calculate any number of measures inside summarize. Here, we’ll use R’s built in mean and median functions to calculate … well, you get the idea.

Let’s look just a the number of minutes each position gets.

nbaplayers |>
  group_by(Pos) |>
  summarise(
    count = n(),
    mean_minutes = mean(MP),
    median_minutes = median(MP)
  )
# A tibble: 9 × 4
  Pos   count mean_minutes median_minutes
  <chr> <int>        <dbl>          <dbl>
1 C       111         891.           887 
2 C-PF      2         316.           316.
3 PF      135         790.           567 
4 PF-C      2        1548.          1548.
5 PG      111         944.           850 
6 SF      113         877.           754 
7 SF-PF     4         638.           286.
8 SF-SG     3        1211           1688 
9 SG      170         843.           654.

So there’s 651 players in the data. Let’s look at shooting guards. The average shooting guard plays 842 minutes and the median is 653.5 minutes.

Why?

Let’s let sort help us.

nbaplayers |> arrange(desc(MP))
# A tibble: 651 × 27
      Rk Player    Pos     Age Tm        G    MP   PER `TS%` `3PAr`   FTr `ORB%`
   <dbl> <chr>     <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl>  <dbl>
 1   323 CJ McCol… SG       28 POR      70  2556  17   0.541  0.378 0.136    1.9
 2    55 Devin Bo… SG       23 PHO      70  2512  20.6 0.618  0.31  0.397    1.3
 3   198 James Ha… SG       30 HOU      68  2483  29.1 0.626  0.557 0.528    2.9
 4    27 Harrison… PF       27 SAC      72  2482  13.3 0.574  0.338 0.337    3.4
 5   297 Damian L… PG       29 POR      66  2474  26.9 0.627  0.5   0.384    1.4
 6   204 Tobias H… PF       27 PHI      72  2469  17.2 0.556  0.304 0.184    3.1
 7   479 P.J. Tuc… PF       34 HOU      72  2467   8.3 0.559  0.702 0.113    4.7
 8   175 Shai Gil… SG       21 OKC      70  2428  17.7 0.568  0.247 0.352    2.2
 9     2 Bam Adeb… PF       22 MIA      72  2417  20.3 0.598  0.018 0.484    8.5
10   343 Donovan … SG       23 UTA      69  2364  18.8 0.558  0.352 0.24     2.6
# ℹ 641 more rows
# ℹ 15 more variables: `DRB%` <dbl>, `TRB%` <dbl>, `AST%` <dbl>, `STL%` <dbl>,
#   `BLK%` <dbl>, `TOV%` <dbl>, `USG%` <dbl>, OWS <dbl>, DWS <dbl>, WS <dbl>,
#   `WS/48` <dbl>, OBPM <dbl>, DBPM <dbl>, BPM <dbl>, VORP <dbl>

The player with the most minutes on the floor is a shooting guard. Shooting guard is the most common position, so that means there’s CJ McCollum rolling up 2,556 minutes in a season, and then there’s Cleveland Cavalier’s sensation J.P. Macura. Never heard of J.P. Macura? Might be because he logged one minute in one game this season.

That’s a huge difference.

So when choosing a measure of the middle, you have to ask yourself – could I have extremes? Because a median won’t be sensitive to extremes. It will be the point at which half the numbers are above and half are below. The average or mean will be a measure of the middle, but if you have a bunch of pine riders and then one ironman superstar, the average will be wildly skewed.

4.3 Even more aggregates

There’s a ton of things we can do in summarize – we’ll work with more of them as the course progresses – but here’s a few other questions you can ask.

Which position in the NBA plays the most minutes? And what is the highest and lowest minute total for that position? And how wide is the spread between minutes? We can find that with sum to add up the minutes to get the total minutes, min to find the minimum minutes, max to find the maximum minutes and sd to find the standard deviation in the numbers.

nbaplayers |> 
  group_by(Pos) |> 
  summarise(
    total = sum(MP), 
    avgminutes = mean(MP), 
    minminutes = min(MP),
    maxminutes = max(MP),
    stdev = sd(MP)) |> arrange(desc(total))
# A tibble: 9 × 6
  Pos    total avgminutes minminutes maxminutes stdev
  <chr>  <dbl>      <dbl>      <dbl>      <dbl> <dbl>
1 SG    143229       843.          1       2556 735. 
2 PF    106654       790.          5       2482 719. 
3 PG    104745       944.          8       2474 727. 
4 SF     99109       877.         11       2316 709. 
5 C      98914       891.          3       2336 619. 
6 SF-SG   3633      1211          87       1858 977. 
7 PF-C    3097      1548.        960       2137 832. 
8 SF-PF   2553       638.         46       1936 873. 
9 C-PF     633       316.        256        377  85.6

So again, no surprise, shooting guards spend the most minutes on the floor in the NBA. They average 842 minutes, but we noted why that’s trouble. The minimum is the J.P. Macura Award, max is the Trailblazer’s failing at load management, and the standard deviation is a measure of how spread out the data is. In this case, not the highest spread among positions, but pretty high. So you know you’ve got some huge minutes players and a bunch of bench players.