7  Mutating data

One of the most common data analysis techniques is to look at change over time. The most common way of comparing change over time is through percent change. The math behind calculating percent change is very simple, and you should know it off the top of your head. The easy way to remember it is:

(new - old) / old

Or new minus old divided by old. Your new number minus the old number, the result of which is divided by the old number. To do that in R, we can use dplyr and mutate to calculate new metrics in a new field using existing fields of data.

So first we’ll import the tidyverse so we can read in our data and begin to work with it.

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Now we’ll import a common and simple dataset of county population estimates from the US Census Bureau. Each year, the Census Bureau publishes estimates for states and counties. This one has every county in the US. A common question: who are the winners and losers?

population <- read_csv('data/countypopulations.csv')
Rows: 3142 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): STNAME, CTYNAME
dbl (11): CENSUS2010POP, ESTIMATESBASE2010, POPESTIMATE2010, POPESTIMATE2011...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

The code to calculate percent change is pretty simple. Remember, with summarize, we used n() to count things. With mutate, we use very similar syntax to calculate a new value – a new column of data – using other values in our dataset. So in this case, we’re trying to do (new-old)/old, but we’re doing it with fields.

If we look at what we got when we imported the data, you’ll see there’s POPESTIMATE2018 as the new data, and we’ll use POPESTIMATE2017 as the old data. So we’re looking at one year. Then, to help us, we’ll use arrange again to sort it, so we get the fastest growing county over one year.

population |> mutate(
  change = (POPESTIMATE2018 - POPESTIMATE2017)/POPESTIMATE2017
) 
# A tibble: 3,142 × 14
   STNAME  CTYNAME         CENSUS2010POP ESTIMATESBASE2010 POPESTIMATE2010
   <chr>   <chr>                   <dbl>             <dbl>           <dbl>
 1 Alabama Autauga County          54571             54574           54754
 2 Alabama Baldwin County         182265            182264          183111
 3 Alabama Barbour County          27457             27457           27330
 4 Alabama Bibb County             22915             22920           22872
 5 Alabama Blount County           57322             57321           57373
 6 Alabama Bullock County          10914             10911           10878
 7 Alabama Butler County           20947             20943           20942
 8 Alabama Calhoun County         118572            118594          118477
 9 Alabama Chambers County         34215             34171           34122
10 Alabama Cherokee County         25989             25989           25974
# ℹ 3,132 more rows
# ℹ 9 more variables: POPESTIMATE2011 <dbl>, POPESTIMATE2012 <dbl>,
#   POPESTIMATE2013 <dbl>, POPESTIMATE2014 <dbl>, POPESTIMATE2015 <dbl>,
#   POPESTIMATE2016 <dbl>, POPESTIMATE2017 <dbl>, POPESTIMATE2018 <dbl>,
#   change <dbl>

Click the black arrow pointing right and you’ll see, way out on the right, your change column. But what do you see right away? Do those numbers look like we expect them to? No. They’re a decimal expressed as a percentage. So let’s fix that by multiplying by 100.

population |> mutate(
  change = ((POPESTIMATE2018 - POPESTIMATE2017)/POPESTIMATE2017)*100
) 
# A tibble: 3,142 × 14
   STNAME  CTYNAME         CENSUS2010POP ESTIMATESBASE2010 POPESTIMATE2010
   <chr>   <chr>                   <dbl>             <dbl>           <dbl>
 1 Alabama Autauga County          54571             54574           54754
 2 Alabama Baldwin County         182265            182264          183111
 3 Alabama Barbour County          27457             27457           27330
 4 Alabama Bibb County             22915             22920           22872
 5 Alabama Blount County           57322             57321           57373
 6 Alabama Bullock County          10914             10911           10878
 7 Alabama Butler County           20947             20943           20942
 8 Alabama Calhoun County         118572            118594          118477
 9 Alabama Chambers County         34215             34171           34122
10 Alabama Cherokee County         25989             25989           25974
# ℹ 3,132 more rows
# ℹ 9 more variables: POPESTIMATE2011 <dbl>, POPESTIMATE2012 <dbl>,
#   POPESTIMATE2013 <dbl>, POPESTIMATE2014 <dbl>, POPESTIMATE2015 <dbl>,
#   POPESTIMATE2016 <dbl>, POPESTIMATE2017 <dbl>, POPESTIMATE2018 <dbl>,
#   change <dbl>

Now, does this ordering do anything for us? No. Let’s fix that with arrange.

population |> mutate(
  change = ((POPESTIMATE2018 - POPESTIMATE2017)/POPESTIMATE2017)*100
)  |> arrange(desc(change))
# A tibble: 3,142 × 14
   STNAME         CTYNAME        CENSUS2010POP ESTIMATESBASE2010 POPESTIMATE2010
   <chr>          <chr>                  <dbl>             <dbl>           <dbl>
 1 Texas          Loving County             82                82              84
 2 Colorado       San Juan Coun…           699               699             708
 3 North Dakota   McKenzie Coun…          6360              6359            6411
 4 Kentucky       Lee County              7887              7887            7718
 5 North Dakota   Williams Coun…         22398             22399           22588
 6 Texas          Comal County          108472            108485          109270
 7 Texas          Kenedy County            416               413             417
 8 Texas          Kaufman County        103350            103363          103890
 9 North Carolina Brunswick Cou…        107431            107429          108065
10 Florida        Walton County          55043             55043           55211
# ℹ 3,132 more rows
# ℹ 9 more variables: POPESTIMATE2011 <dbl>, POPESTIMATE2012 <dbl>,
#   POPESTIMATE2013 <dbl>, POPESTIMATE2014 <dbl>, POPESTIMATE2015 <dbl>,
#   POPESTIMATE2016 <dbl>, POPESTIMATE2017 <dbl>, POPESTIMATE2018 <dbl>,
#   change <dbl>

So who had the most growth last year from the year before? Is everyone moving to Loving County, Texas? Or is it small changes in a small county? Also, note North Dakota showing up twice in the top 10.

7.1 Another use of mutate

Note in our data we have separate State and County name fields. If we were publishing this, we wouldn’t want that.

So how can we fix that? Mutate! And a new function to combine text together called paste. Paste allows us to merge fields together easily with a separator. In our case, we want to combine the county name and the state name with a comma and a space between them.

population |> 
  mutate(
    change = ((POPESTIMATE2018 - POPESTIMATE2017)/POPESTIMATE2017)*100,
    location = paste(CTYNAME, STNAME, sep=", ")) |> 
  arrange(desc(change))
# A tibble: 3,142 × 15
   STNAME         CTYNAME        CENSUS2010POP ESTIMATESBASE2010 POPESTIMATE2010
   <chr>          <chr>                  <dbl>             <dbl>           <dbl>
 1 Texas          Loving County             82                82              84
 2 Colorado       San Juan Coun…           699               699             708
 3 North Dakota   McKenzie Coun…          6360              6359            6411
 4 Kentucky       Lee County              7887              7887            7718
 5 North Dakota   Williams Coun…         22398             22399           22588
 6 Texas          Comal County          108472            108485          109270
 7 Texas          Kenedy County            416               413             417
 8 Texas          Kaufman County        103350            103363          103890
 9 North Carolina Brunswick Cou…        107431            107429          108065
10 Florida        Walton County          55043             55043           55211
# ℹ 3,132 more rows
# ℹ 10 more variables: POPESTIMATE2011 <dbl>, POPESTIMATE2012 <dbl>,
#   POPESTIMATE2013 <dbl>, POPESTIMATE2014 <dbl>, POPESTIMATE2015 <dbl>,
#   POPESTIMATE2016 <dbl>, POPESTIMATE2017 <dbl>, POPESTIMATE2018 <dbl>,
#   change <dbl>, location <chr>

EXERCISE: What happens when you sort it in ascending order? Delete the desc part in arrange and see what happens. How would you describe this list?