6 Aggregates

R is a statistical programming language that is purpose built for data analysis.

Base R does a lot, but there are a mountain of external libraries that do things to make R better/easier/more fully featured. We already installed the tidyverse – or you should have if you followed the instructions for the last assignment – which isn’t exactly a library, but a collection of libraries. Together, they make up the tidyverse. Individually, they are extraordinarily useful for what they do. We can load them all at once using the tidyverse name, or we can load them individually. Let’s start with individually.

The two libraries we are going to need for this assignment are readr and dplyr. The library readr reads different types of data in. For this assignment, we’re going to read in csv data or Comma Separated Values data. That’s data that has a comma between each column of data.

Then we’re going to use dplyr to analyze it.

To use a library, you need to import it. Good practice – one I’m going to insist on – is that you put all your library steps at the top of your notebooks.

That code looks like this:

library(readr)

To load them both, you need to do this:

library(readr)
library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

But, because those two libraries – and several others that we’re going to use over the course of this class – are so commonly used, there’s a shortcut to loading all of the libraries we’ll need:

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

You can keep doing that for as many libraries as you need. I’ve seen notebooks with 10 or more library imports.

6.1 Importing data

The first thing we need to do is get some data to work with. We do that by reading it in. In our case, we’re going to read data from a csv file – a comma-separated values file.

The CSV file we’re going to read from is a Nebraska Game and Parks Commission dataset of confirmed mountain lion sightings in Nebraska. There are, on occasion, fierce debates about mountain lions and if they should be hunted in Nebraska. This dataset can tell us some interesting things about that debate.

So step 1 is to import the data. The code looks something like this, but hold off copying it just yet:

mountainlions <- read_csv("~/Documents/Data/mountainlions.csv")

Let’s unpack that.

The first part – mountainlions – is the name of your variable. A variable is just a name of a thing. In this case, our variable is a dataframe, which is R’s way of storing data. We can call this whatever we want. I always want to name dataframes after what is in it. In this case, we’re going to import a dataset of mountain lion sightings from the Nebraska Game and Parks Commission. Variable names, by convention are one word all lower case. You can end a variable with a number, but you can’t start one with a number.

The <- bit, you’ll recall from the basics, is the variable assignment operator. It’s how we know we’re assigning something to a word. Think of the arrow as saying “Take everything on the right of this arrow and stuff it into the thing on the left.” So we’re creating an empty vessel called mountainlions and stuffing all this data into it.

The read_csv bits are pretty obvious, except for one thing. What happens in the quote marks is the path to the data. In there, I have to tell R where it will find the data.

The easiest thing to do, if you are confused about how to find your data, is to put your data in the same folder as as your notebook (you’ll have to save that notebook first). If you do that, then you just need to put the name of the file in there (mountainlions.csv).

In my case, the file path I’ve got starts with a ~ character. That’s a shortcut for my home directory. It’s the same on your computer. Your home directory is where your Documents, Desktop and Downloads directories are. I’ve got a folder called Documents in my home directory, and in there is a folder called Data that has the file called mountainlions.csv in it. Thus, ~/Documents/Data/mountainlions.csv

Some people – insane people – leave the data in their downloads folder. The data path then would be ~/Downloads/nameofthedatafilehere.csv on PC or Mac.

A quick way to find your data file? The tab key. If you start your code dataframenamehere <- read_csv(") and after typing the first quote mark you hit tab, it will show you the files in the folder you are in. With that, you can start to narrow in on where you need to go.

So what you put in your import step will be different from mine. Your first task is to import the data. Here’s mine. Use the tab key to find your data file and get the correct path.

mountainlions <- read_csv("data/mountainlions.csv")

Rows: 393 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): Cofirm Type, COUNTY, Date
dbl (1): ID

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Now we can inspect the data we imported. What does it look like? To do that, we use head(mountainlions) to show the headers and the first six rows of data. If we wanted to see them all, we could just simply enter mountainlions and run it.

To get the number of records in our dataset, we run nrow(mountainlions)

head(mountainlions)

# A tibble: 6 × 4
     ID `Cofirm Type` COUNTY       Date    
  <dbl> <chr>         <chr>        <chr>   
1     1 Track         Dawes        9/14/91 
2     2 Mortality     Sioux        11/10/91
3     3 Mortality     Scotts Bluff 4/21/96 
4     4 Mortality     Sioux        5/9/99  
5     5 Mortality     Box Butte    9/29/99 
6     6 Track         Scotts Bluff 11/12/99

nrow(mountainlions)

[1] 393

6.2 Group by and count

So what if we wanted to know how many mountain lion sightings there were in each county?

To do that by hand, we’d have to take each of the 393 records and sort them into a pile. We’d put them in groups and then count them.

dplyr has a group by function in it that does just this. A massive amount of data analysis involves grouping like things together and then doing simple things like counting them, or averaging them together. So it’s a good place to start.

So to do this, we’ll take our dataset and we’ll introduce a new operator: |>. The best way to read that operator, in my opinion, is to interpret that as “and then do this.”

We’re going to establish a pattern that will come up again and again throughout this book: data |> function. The first step of every analysis starts with the data being used. Then we apply functions to the data.

In our case, the pattern that you’ll use many, many times is: data |> group_by(FIELD NAME) |> summarize(VARIABLE NAME = AGGREGATE FUNCTION(FIELD NAME))

Here’s the code:

mountainlions |>
  group_by(COUNTY) |>
  summarise(
    total = n()
  )

# A tibble: 42 × 2
   COUNTY    total
   <chr>     <int>
 1 Banner        6
 2 Blaine        3
 3 Box Butte     4
 4 Brown        15
 5 Buffalo       3
 6 Cedar         1
 7 Cherry       30
 8 Custer        8
 9 Dakota        3
10 Dawes       111
# ℹ 32 more rows

So let’s walk through that. We start with our dataset – mountainlions – and then we tell it to group the data by a given field in the data. In this case, we wanted to group together all the counties, signified by the field name COUNTY, which you could get from looking at head(mountainlions). After we group the data, we need to count them up. In dplyr, we use summarize which can do more than just count things. Inside the parentheses in summarize, we set up the summaries we want. In this case, we just want a count of the counties: total = n(), says create a new field, called total and set it equal to n(), which might look weird, but it’s common in stats. The number of things in a dataset? Statisticians call in n. There are n number of incidents in this dataset. So n() is a function that counts the number of things there are.

And when we run that, we get a list of counties with a count next to them. But it’s not in any order. So we’ll add another And Then Do This |> and use arrange. Arrange does what you think it does – it arranges data in order. By default, it’s in ascending order – smallest to largest. But if we want to know the county with the most mountain lion sightings, we need to sort it in descending order. That looks like this:

mountainlions |>
  group_by(COUNTY) |>
  summarise(
    count = n()
  ) |> arrange(desc(count))

# A tibble: 42 × 2
   COUNTY       count
   <chr>        <int>
 1 Dawes          111
 2 Sioux           52
 3 Sheridan        35
 4 Cherry          30
 5 Scotts Bluff    26
 6 Keya Paha       20
 7 Brown           15
 8 Rock            11
 9 Lincoln         10
10 Custer           8
# ℹ 32 more rows

We can, if we want, group by more than one thing. So how are these sightings being confirmed? To do that, we can group by County and “Cofirm Type”, which is how the state misspelled Confirm. But note something in this example below:

mountainlions |>
  group_by(COUNTY, `Cofirm Type`) |>
  summarise(
    count = n()
  ) |> arrange(desc(count))

`summarise()` has grouped output by 'COUNTY'. You can override using the
`.groups` argument.

# A tibble: 93 × 3
# Groups:   COUNTY [42]
   COUNTY       `Cofirm Type`      count
   <chr>        <chr>              <int>
 1 Dawes        Trail Camera Photo    41
 2 Sioux        Trail Camera Photo    40
 3 Dawes        Track                 19
 4 Keya Paha    Trail Camera Photo    18
 5 Cherry       Trail Camera Photo    17
 6 Dawes        Mortality             17
 7 Sheridan     Trail Camera Photo    16
 8 Dawes        Photo                 13
 9 Dawes        DNA                   11
10 Scotts Bluff Trail Camera Photo    11
# ℹ 83 more rows

See it? When you have a field name that has two words, readr wraps it in back ticks, which is next to the 1 key on your keyboard. You can figure out which fields have back ticks around it by looking at the output of readr. Pay attention to that, because it’s coming up again in the next section and will be a part of your homework.

6.3 Other aggregates: Mean and median

In the last example, we grouped some data together and counted it up, but there’s so much more you can do. You can do multiple measures in a single step as well.

Let’s look at some salary data from the University of Nebraska.

salaries <- read_csv("data/nusalaries1819.csv")

Rows: 13039 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): Employee, Position, Campus, Department
num (3): Budgeted Annual Salary, Salary from State Aided Funds, Salary from ...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(salaries)

# A tibble: 6 × 7
  Employee                     Position Campus Department Budgeted Annual Sala…¹
  <chr>                        <chr>    <chr>  <chr>                       <dbl>
1 Abbey, Bryce M               Associa… UNK    Kinesiolo…                  61276
2 Abbott, Frances M            Staff S… UNL    FM&P Faci…                  37318
3 Abboud, Cheryl A             Adminis… UNMC   Surgery-U…                  76400
4 Abdalla, Maher Y             Asst Pr… UNMC   Pathology…                  74774
5 Abdelkarim, Ahmed Mohamed A… Post-Do… UNMC   Surgery-T…                  43516
6 Abdel-Monem, Tarik L         Researc… UNL    Public Po…                  58502
# ℹ abbreviated name: ¹`Budgeted Annual Salary`
# ℹ 2 more variables: `Salary from State Aided Funds` <dbl>,
#   `Salary from Other Funds` <dbl>

In summarize, we can calculate any number of measures. Here, we’ll use R’s built in mean and median functions to calculate … well, you get the idea.

salaries |>
  summarise(
    count = n(),
    mean_salary = mean(`Budgeted Annual Salary`),
    median_salary = median(`Budgeted Annual Salary`)
  )

# A tibble: 1 × 3
  count mean_salary median_salary
  <int>       <dbl>         <dbl>
1 13039      62065.         51343

So there’s 13,039 employees in the database, spread across four campuses plus the system office. The mean or average salary is about $62,000, but the median salary is slightly more than $51,000.

Why?

Let’s let sort help us.

salaries |> arrange(desc(`Budgeted Annual Salary`))

# A tibble: 13,039 × 7
   Employee             Position        Campus Department Budgeted Annual Sala…¹
   <chr>                <chr>           <chr>  <chr>                       <dbl>
 1 Frost, Scott A       Head Coach-Foo… UNL    Athletics                 5000000
 2 Miles, Timothy S     Head Coach-Bas… UNL    Athletics                 2375000
 3 Moos, William H      Athletic Direc… UNL    Athletics                 1000000
 4 Gold, Jeffrey P      Chancellor      UNMC   Office of…                 853338
 5 Chinander, Erik J    Assistant Coac… UNL    Athletics                  800000
 6 Walters, Troy M      Assistant Coac… UNL    Athletics                  700000
 7 Cook, John G         Head Coach-Vol… UNL    Athletics                  675000
 8 Williams, Amy M      Head Coach-Wom… UNL    Athletics                  626750
 9 Bounds, Hank M       President       UNCA   Office of…                 540000
10 Austin Jr, Gregory D Assistant Coac… UNL    Athletics                  475000
# ℹ 13,029 more rows
# ℹ abbreviated name: ¹`Budgeted Annual Salary`
# ℹ 2 more variables: `Salary from State Aided Funds` <dbl>,
#   `Salary from Other Funds` <dbl>

Oh, right. In this dataset, the university pays a football coach $5 million. Extremes influence averages, not medians, and now you have your answer.

So when choosing a measure of the middle, you have to ask yourself – could I have extremes? Because a median won’t be sensitive to extremes. It will be the point at which half the numbers are above and half are below. The average or mean will be a measure of the middle, but if you have a bunch of low paid people and then one football coach, the average will be wildly skewed. Here, because there’s so few highly paid football coaches compared to people who make a normal salary, the number is only slightly skewed in the grand scheme, but skewed nonetheless.

6.4 Even more aggregates

There’s a ton of things we can do in summarize – we’ll work with more of them as the course progresses – but here’s a few other questions you can ask.

Which department on campus has the highest wage bill? And what is the highest and lowest salary in the department? And how wide is the spread between salaries? We can find that with sum to add up the salaries to get the total wage bill, min to find the minumum salary, max to find the maximum salary and sd to find the standard deviation in the numbers.

salaries |> 
  group_by(Campus, Department) |> 
  summarize(
    total = sum(`Budgeted Annual Salary`), 
    avgsalary = mean(`Budgeted Annual Salary`), 
    minsalary = min(`Budgeted Annual Salary`),
    maxsalary = max(`Budgeted Annual Salary`),
    stdev = sd(`Budgeted Annual Salary`)) |> arrange(desc(total))

`summarise()` has grouped output by 'Campus'. You can override using the
`.groups` argument.

# A tibble: 804 × 7
# Groups:   Campus [5]
   Campus Department                  total avgsalary minsalary maxsalary  stdev
   <chr>  <chr>                       <dbl>     <dbl>     <dbl>     <dbl>  <dbl>
 1 UNL    Athletics                  3.56e7   118508.     12925   5000000 3.33e5
 2 UNMC   Pathology/Microbiology     1.36e7    63158.      1994    186925 3.41e4
 3 UNL    Agronomy & Horticulture    8.98e6    66496.      5000    208156 4.01e4
 4 UNMC   Anesthesiology             7.90e6    78237.     10000    245174 3.59e4
 5 UNL    School of Natural Resourc… 6.86e6    65995.      2400    194254 3.28e4
 6 UNL    College of Law             6.70e6    77953.      1000    326400 7.23e4
 7 UNL    University Television      6.44e6    55542.     16500    221954 2.75e4
 8 UNL    University Libraries       6.27e6    51390.      1200    215917 2.68e4
 9 UNMC   Pharmacology/Exp Neurosci… 6.24e6    58911.      2118    248139 4.29e4
10 UNMC   CON-Omaha Division         6.11e6    78304.      3000    172522 4.48e4
# ℹ 794 more rows

So again, no surprise, the UNL athletic department has the single largest wage bill at nearly $36 million. The average salary in the department is $118,508 – more than double the univeristy as a whole, again thanks to Scott Frost’s paycheck.