library(readr)
6 Aggregates
R is a statistical programming language that is purpose built for data analysis.
Base R does a lot, but there are a mountain of external libraries that do things to make R better/easier/more fully featured. We already installed the tidyverse – or you should have if you followed the instructions for the last assignment – which isn’t exactly a library, but a collection of libraries. Together, they make up the tidyverse. Individually, they are extraordinarily useful for what they do. We can load them all at once using the tidyverse name, or we can load them individually. Let’s start with individually.
The two libraries we are going to need for this assignment are readr
and dplyr
. The library readr
reads different types of data in. For this assignment, we’re going to read in csv data or Comma Separated Values data. That’s data that has a comma between each column of data.
Then we’re going to use dplyr
to analyze it.
To use a library, you need to import it. Good practice – one I’m going to insist on – is that you put all your library steps at the top of your notebooks.
That code looks like this:
To load them both, you need to do this:
library(readr)
library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
But, because those two libraries – and several others that we’re going to use over the course of this class – are so commonly used, there’s a shortcut to loading all of the libraries we’ll need:
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ forcats 1.0.0 ✔ stringr 1.5.0
✔ ggplot2 3.4.4 ✔ tibble 3.2.1
✔ lubridate 1.9.2 ✔ tidyr 1.3.0
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
You can keep doing that for as many libraries as you need. I’ve seen notebooks with 10 or more library imports.
6.1 Importing data
The first thing we need to do is get some data to work with. We do that by reading it in. In our case, we’re going to read data from a csv file – a comma-separated values file.
The CSV file we’re going to read from is a Nebraska Game and Parks Commission dataset of confirmed mountain lion sightings in Nebraska. There are, on occasion, fierce debates about mountain lions and if they should be hunted in Nebraska. This dataset can tell us some interesting things about that debate.
So step 1 is to import the data. The code looks something like this, but hold off copying it just yet:
mountainlions <- read_csv("~/Documents/Data/mountainlions.csv")
Let’s unpack that.
The first part – mountainlions – is the name of your variable. A variable is just a name of a thing. In this case, our variable is a dataframe, which is R’s way of storing data. We can call this whatever we want. I always want to name dataframes after what is in it. In this case, we’re going to import a dataset of mountain lion sightings from the Nebraska Game and Parks Commission. Variable names, by convention are one word all lower case. You can end a variable with a number, but you can’t start one with a number.
The <- bit, you’ll recall from the basics, is the variable assignment operator. It’s how we know we’re assigning something to a word. Think of the arrow as saying “Take everything on the right of this arrow and stuff it into the thing on the left.” So we’re creating an empty vessel called mountainlions and stuffing all this data into it.
The read_csv
bits are pretty obvious, except for one thing. What happens in the quote marks is the path to the data. In there, I have to tell R where it will find the data.
The easiest thing to do, if you are confused about how to find your data, is to put your data in the same folder as as your notebook (you’ll have to save that notebook first). If you do that, then you just need to put the name of the file in there (mountainlions.csv).
In my case, the file path I’ve got starts with a ~ character. That’s a shortcut for my home directory. It’s the same on your computer. Your home directory is where your Documents, Desktop and Downloads directories are. I’ve got a folder called Documents in my home directory, and in there is a folder called Data that has the file called mountainlions.csv in it. Thus, ~/Documents/Data/mountainlions.csv
Some people – insane people – leave the data in their downloads folder. The data path then would be ~/Downloads/nameofthedatafilehere.csv
on PC or Mac.
A quick way to find your data file? The tab key. If you start your code dataframenamehere <- read_csv(")
and after typing the first quote mark you hit tab, it will show you the files in the folder you are in. With that, you can start to narrow in on where you need to go.
So what you put in your import step will be different from mine. Your first task is to import the data. Here’s mine. Use the tab key to find your data file and get the correct path.
<- read_csv("data/mountainlions.csv") mountainlions
Rows: 393 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): Cofirm Type, COUNTY, Date
dbl (1): ID
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Now we can inspect the data we imported. What does it look like? To do that, we use head(mountainlions)
to show the headers and the first six rows of data. If we wanted to see them all, we could just simply enter mountainlions
and run it.
To get the number of records in our dataset, we run nrow(mountainlions)
head(mountainlions)
# A tibble: 6 × 4
ID `Cofirm Type` COUNTY Date
<dbl> <chr> <chr> <chr>
1 1 Track Dawes 9/14/91
2 2 Mortality Sioux 11/10/91
3 3 Mortality Scotts Bluff 4/21/96
4 4 Mortality Sioux 5/9/99
5 5 Mortality Box Butte 9/29/99
6 6 Track Scotts Bluff 11/12/99
nrow(mountainlions)
[1] 393
6.2 Group by and count
So what if we wanted to know how many mountain lion sightings there were in each county?
To do that by hand, we’d have to take each of the 393 records and sort them into a pile. We’d put them in groups and then count them.
dplyr
has a group by function in it that does just this. A massive amount of data analysis involves grouping like things together and then doing simple things like counting them, or averaging them together. So it’s a good place to start.
So to do this, we’ll take our dataset and we’ll introduce a new operator: |>
. The best way to read that operator, in my opinion, is to interpret that as “and then do this.”
We’re going to establish a pattern that will come up again and again throughout this book: data |> function
. The first step of every analysis starts with the data being used. Then we apply functions to the data.
In our case, the pattern that you’ll use many, many times is: data |> group_by(FIELD NAME) |> summarize(VARIABLE NAME = AGGREGATE FUNCTION(FIELD NAME))
Here’s the code:
|>
mountainlions group_by(COUNTY) |>
summarise(
total = n()
)
# A tibble: 42 × 2
COUNTY total
<chr> <int>
1 Banner 6
2 Blaine 3
3 Box Butte 4
4 Brown 15
5 Buffalo 3
6 Cedar 1
7 Cherry 30
8 Custer 8
9 Dakota 3
10 Dawes 111
# ℹ 32 more rows
So let’s walk through that. We start with our dataset – mountainlions
– and then we tell it to group the data by a given field in the data. In this case, we wanted to group together all the counties, signified by the field name COUNTY, which you could get from looking at head(mountainlions)
. After we group the data, we need to count them up. In dplyr, we use summarize
which can do more than just count things. Inside the parentheses in summarize, we set up the summaries we want. In this case, we just want a count of the counties: total = n(),
says create a new field, called total
and set it equal to n()
, which might look weird, but it’s common in stats. The number of things in a dataset? Statisticians call in n. There are n number of incidents in this dataset. So n()
is a function that counts the number of things there are.
And when we run that, we get a list of counties with a count next to them. But it’s not in any order. So we’ll add another And Then Do This |> and use arrange
. Arrange does what you think it does – it arranges data in order. By default, it’s in ascending order – smallest to largest. But if we want to know the county with the most mountain lion sightings, we need to sort it in descending order. That looks like this:
|>
mountainlions group_by(COUNTY) |>
summarise(
count = n()
|> arrange(desc(count)) )
# A tibble: 42 × 2
COUNTY count
<chr> <int>
1 Dawes 111
2 Sioux 52
3 Sheridan 35
4 Cherry 30
5 Scotts Bluff 26
6 Keya Paha 20
7 Brown 15
8 Rock 11
9 Lincoln 10
10 Custer 8
# ℹ 32 more rows
We can, if we want, group by more than one thing. So how are these sightings being confirmed? To do that, we can group by County and “Cofirm Type”, which is how the state misspelled Confirm. But note something in this example below:
|>
mountainlions group_by(COUNTY, `Cofirm Type`) |>
summarise(
count = n()
|> arrange(desc(count)) )
`summarise()` has grouped output by 'COUNTY'. You can override using the
`.groups` argument.
# A tibble: 93 × 3
# Groups: COUNTY [42]
COUNTY `Cofirm Type` count
<chr> <chr> <int>
1 Dawes Trail Camera Photo 41
2 Sioux Trail Camera Photo 40
3 Dawes Track 19
4 Keya Paha Trail Camera Photo 18
5 Cherry Trail Camera Photo 17
6 Dawes Mortality 17
7 Sheridan Trail Camera Photo 16
8 Dawes Photo 13
9 Dawes DNA 11
10 Scotts Bluff Trail Camera Photo 11
# ℹ 83 more rows
See it? When you have a field name that has two words, readr
wraps it in back ticks, which is next to the 1 key on your keyboard. You can figure out which fields have back ticks around it by looking at the output of readr
. Pay attention to that, because it’s coming up again in the next section and will be a part of your homework.
6.3 Other aggregates: Mean and median
In the last example, we grouped some data together and counted it up, but there’s so much more you can do. You can do multiple measures in a single step as well.
Let’s look at some salary data from the University of Nebraska.
<- read_csv("data/nusalaries1819.csv") salaries
Rows: 13039 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): Employee, Position, Campus, Department
num (3): Budgeted Annual Salary, Salary from State Aided Funds, Salary from ...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(salaries)
# A tibble: 6 × 7
Employee Position Campus Department Budgeted Annual Sala…¹
<chr> <chr> <chr> <chr> <dbl>
1 Abbey, Bryce M Associa… UNK Kinesiolo… 61276
2 Abbott, Frances M Staff S… UNL FM&P Faci… 37318
3 Abboud, Cheryl A Adminis… UNMC Surgery-U… 76400
4 Abdalla, Maher Y Asst Pr… UNMC Pathology… 74774
5 Abdelkarim, Ahmed Mohamed A… Post-Do… UNMC Surgery-T… 43516
6 Abdel-Monem, Tarik L Researc… UNL Public Po… 58502
# ℹ abbreviated name: ¹`Budgeted Annual Salary`
# ℹ 2 more variables: `Salary from State Aided Funds` <dbl>,
# `Salary from Other Funds` <dbl>
In summarize, we can calculate any number of measures. Here, we’ll use R’s built in mean and median functions to calculate … well, you get the idea.
|>
salaries summarise(
count = n(),
mean_salary = mean(`Budgeted Annual Salary`),
median_salary = median(`Budgeted Annual Salary`)
)
# A tibble: 1 × 3
count mean_salary median_salary
<int> <dbl> <dbl>
1 13039 62065. 51343
So there’s 13,039 employees in the database, spread across four campuses plus the system office. The mean or average salary is about $62,000, but the median salary is slightly more than $51,000.
Why?
Let’s let sort help us.
|> arrange(desc(`Budgeted Annual Salary`)) salaries
# A tibble: 13,039 × 7
Employee Position Campus Department Budgeted Annual Sala…¹
<chr> <chr> <chr> <chr> <dbl>
1 Frost, Scott A Head Coach-Foo… UNL Athletics 5000000
2 Miles, Timothy S Head Coach-Bas… UNL Athletics 2375000
3 Moos, William H Athletic Direc… UNL Athletics 1000000
4 Gold, Jeffrey P Chancellor UNMC Office of… 853338
5 Chinander, Erik J Assistant Coac… UNL Athletics 800000
6 Walters, Troy M Assistant Coac… UNL Athletics 700000
7 Cook, John G Head Coach-Vol… UNL Athletics 675000
8 Williams, Amy M Head Coach-Wom… UNL Athletics 626750
9 Bounds, Hank M President UNCA Office of… 540000
10 Austin Jr, Gregory D Assistant Coac… UNL Athletics 475000
# ℹ 13,029 more rows
# ℹ abbreviated name: ¹`Budgeted Annual Salary`
# ℹ 2 more variables: `Salary from State Aided Funds` <dbl>,
# `Salary from Other Funds` <dbl>
Oh, right. In this dataset, the university pays a football coach $5 million. Extremes influence averages, not medians, and now you have your answer.
So when choosing a measure of the middle, you have to ask yourself – could I have extremes? Because a median won’t be sensitive to extremes. It will be the point at which half the numbers are above and half are below. The average or mean will be a measure of the middle, but if you have a bunch of low paid people and then one football coach, the average will be wildly skewed. Here, because there’s so few highly paid football coaches compared to people who make a normal salary, the number is only slightly skewed in the grand scheme, but skewed nonetheless.
6.4 Even more aggregates
There’s a ton of things we can do in summarize – we’ll work with more of them as the course progresses – but here’s a few other questions you can ask.
Which department on campus has the highest wage bill? And what is the highest and lowest salary in the department? And how wide is the spread between salaries? We can find that with sum
to add up the salaries to get the total wage bill, min
to find the minumum salary, max
to find the maximum salary and sd
to find the standard deviation in the numbers.
|>
salaries group_by(Campus, Department) |>
summarize(
total = sum(`Budgeted Annual Salary`),
avgsalary = mean(`Budgeted Annual Salary`),
minsalary = min(`Budgeted Annual Salary`),
maxsalary = max(`Budgeted Annual Salary`),
stdev = sd(`Budgeted Annual Salary`)) |> arrange(desc(total))
`summarise()` has grouped output by 'Campus'. You can override using the
`.groups` argument.
# A tibble: 804 × 7
# Groups: Campus [5]
Campus Department total avgsalary minsalary maxsalary stdev
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 UNL Athletics 3.56e7 118508. 12925 5000000 3.33e5
2 UNMC Pathology/Microbiology 1.36e7 63158. 1994 186925 3.41e4
3 UNL Agronomy & Horticulture 8.98e6 66496. 5000 208156 4.01e4
4 UNMC Anesthesiology 7.90e6 78237. 10000 245174 3.59e4
5 UNL School of Natural Resourc… 6.86e6 65995. 2400 194254 3.28e4
6 UNL College of Law 6.70e6 77953. 1000 326400 7.23e4
7 UNL University Television 6.44e6 55542. 16500 221954 2.75e4
8 UNL University Libraries 6.27e6 51390. 1200 215917 2.68e4
9 UNMC Pharmacology/Exp Neurosci… 6.24e6 58911. 2118 248139 4.29e4
10 UNMC CON-Omaha Division 6.11e6 78304. 3000 172522 4.48e4
# ℹ 794 more rows
So again, no surprise, the UNL athletic department has the single largest wage bill at nearly $36 million. The average salary in the department is $118,508 – more than double the univeristy as a whole, again thanks to Scott Frost’s paycheck.