One of the most common data analysis techniques is to look at change over time. The most common way of comparing change over time is through percent change. The math behind calculating percent change is very simple, and you should know it off the top of your head. The easy way to remember it is:
(new - old) / old
Or new minus old divided by old. Your new number minus the old number, the result of which is divided by the old number. To do that in R, we can use dplyr and mutate to calculate new metrics in a new field using existing fields of data.
So first we’ll import the tidyverse so we can read in our data and begin to work with it.
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.2 ✔ readr 2.1.4
✔ forcats 1.0.0 ✔ stringr 1.5.0
✔ ggplot2 3.4.4 ✔ tibble 3.2.1
✔ lubridate 1.9.2 ✔ tidyr 1.3.0
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Now we’ll import a common and simple dataset of county population estimates from the US Census Bureau. Each year, the Census Bureau publishes estimates for states and counties. This one has every county in the US. A common question: who are the winners and losers?
population <-read_csv('data/countypopulations.csv')
Rows: 3142 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): STNAME, CTYNAME
dbl (11): CENSUS2010POP, ESTIMATESBASE2010, POPESTIMATE2010, POPESTIMATE2011...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
The code to calculate percent change is pretty simple. Remember, with summarize, we used n() to count things. With mutate, we use very similar syntax to calculate a new value – a new column of data – using other values in our dataset. So in this case, we’re trying to do (new-old)/old, but we’re doing it with fields.
If we look at what we got when we imported the data, you’ll see there’s POPESTIMATE2018 as the new data, and we’ll use POPESTIMATE2017 as the old data. So we’re looking at one year. Then, to help us, we’ll use arrange again to sort it, so we get the fastest growing county over one year.
population |>mutate(change = (POPESTIMATE2018 - POPESTIMATE2017)/POPESTIMATE2017)
# A tibble: 3,142 × 14
STNAME CTYNAME CENSUS2010POP ESTIMATESBASE2010 POPESTIMATE2010
<chr> <chr> <dbl> <dbl> <dbl>
1 Alabama Autauga County 54571 54574 54754
2 Alabama Baldwin County 182265 182264 183111
3 Alabama Barbour County 27457 27457 27330
4 Alabama Bibb County 22915 22920 22872
5 Alabama Blount County 57322 57321 57373
6 Alabama Bullock County 10914 10911 10878
7 Alabama Butler County 20947 20943 20942
8 Alabama Calhoun County 118572 118594 118477
9 Alabama Chambers County 34215 34171 34122
10 Alabama Cherokee County 25989 25989 25974
# ℹ 3,132 more rows
# ℹ 9 more variables: POPESTIMATE2011 <dbl>, POPESTIMATE2012 <dbl>,
# POPESTIMATE2013 <dbl>, POPESTIMATE2014 <dbl>, POPESTIMATE2015 <dbl>,
# POPESTIMATE2016 <dbl>, POPESTIMATE2017 <dbl>, POPESTIMATE2018 <dbl>,
# change <dbl>
Click the black arrow pointing right and you’ll see, way out on the right, your change column. But what do you see right away? Do those numbers look like we expect them to? No. They’re a decimal expressed as a percentage. So let’s fix that by multiplying by 100.
population |>mutate(change = ((POPESTIMATE2018 - POPESTIMATE2017)/POPESTIMATE2017)*100)
# A tibble: 3,142 × 14
STNAME CTYNAME CENSUS2010POP ESTIMATESBASE2010 POPESTIMATE2010
<chr> <chr> <dbl> <dbl> <dbl>
1 Alabama Autauga County 54571 54574 54754
2 Alabama Baldwin County 182265 182264 183111
3 Alabama Barbour County 27457 27457 27330
4 Alabama Bibb County 22915 22920 22872
5 Alabama Blount County 57322 57321 57373
6 Alabama Bullock County 10914 10911 10878
7 Alabama Butler County 20947 20943 20942
8 Alabama Calhoun County 118572 118594 118477
9 Alabama Chambers County 34215 34171 34122
10 Alabama Cherokee County 25989 25989 25974
# ℹ 3,132 more rows
# ℹ 9 more variables: POPESTIMATE2011 <dbl>, POPESTIMATE2012 <dbl>,
# POPESTIMATE2013 <dbl>, POPESTIMATE2014 <dbl>, POPESTIMATE2015 <dbl>,
# POPESTIMATE2016 <dbl>, POPESTIMATE2017 <dbl>, POPESTIMATE2018 <dbl>,
# change <dbl>
Now, does this ordering do anything for us? No. Let’s fix that with arrange.
population |>mutate(change = ((POPESTIMATE2018 - POPESTIMATE2017)/POPESTIMATE2017)*100) |>arrange(desc(change))
# A tibble: 3,142 × 14
STNAME CTYNAME CENSUS2010POP ESTIMATESBASE2010 POPESTIMATE2010
<chr> <chr> <dbl> <dbl> <dbl>
1 Texas Loving County 82 82 84
2 Colorado San Juan Coun… 699 699 708
3 North Dakota McKenzie Coun… 6360 6359 6411
4 Kentucky Lee County 7887 7887 7718
5 North Dakota Williams Coun… 22398 22399 22588
6 Texas Comal County 108472 108485 109270
7 Texas Kenedy County 416 413 417
8 Texas Kaufman County 103350 103363 103890
9 North Carolina Brunswick Cou… 107431 107429 108065
10 Florida Walton County 55043 55043 55211
# ℹ 3,132 more rows
# ℹ 9 more variables: POPESTIMATE2011 <dbl>, POPESTIMATE2012 <dbl>,
# POPESTIMATE2013 <dbl>, POPESTIMATE2014 <dbl>, POPESTIMATE2015 <dbl>,
# POPESTIMATE2016 <dbl>, POPESTIMATE2017 <dbl>, POPESTIMATE2018 <dbl>,
# change <dbl>
So who had the most growth last year from the year before? Is everyone moving to Loving County, Texas? Or is it small changes in a small county? Also, note North Dakota showing up twice in the top 10.
7.1 Another use of mutate
Note in our data we have separate State and County name fields. If we were publishing this, we wouldn’t want that.
So how can we fix that? Mutate! And a new function to combine text together called paste. Paste allows us to merge fields together easily with a separator. In our case, we want to combine the county name and the state name with a comma and a space between them.