14 Combining and joining

Often, as data journalists, we’re looking at data across time or at data stored in multiple tables. And to do that, we need to often need to merge that data together.

Depending on what we have, we may just need to stack data on top of each other to make new data. If we have 2019 data and 2018 data and we want that to be one file, we stack them. If we have a dataset of cows in counties and a dataset of populations in county, we’re going to join those two together on the county – the common element.

Let’s explore.

14.1 Combining data

In the last assignment, we harvested data out of PDFs. Let’s reuse what we did there and merge three months of reports from the Department of Revenue together. For mine, I have January 2020, December 2019, and November 2019.

Let’s do what we need to import them properly. Unlike the last example, I’ve merged it all into one step for each of the three datasets.

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

receiptsJan20 <- read_csv("data/tabula-General_Fund_Receipts_January_2020.csv", skip = 3, col_names = c("Month", "TotalActualNetReceipts", "TotalProjectedNetReceipts", "Difference", "PercentDifference", "CumulativeActualNetReceipts", "CumulativeProjectedNetReceipts", "CumulativeDifference","CumulativePercentDifference"), skip_empty_rows = TRUE) |> mutate(
  TotalActualNetReceipts = gsub(",","",TotalActualNetReceipts),
  TotalActualNetReceipts = gsub("\\$","",TotalActualNetReceipts),
  TotalProjectedNetReceipts = gsub(",","",TotalProjectedNetReceipts),
  TotalProjectedNetReceipts = gsub("\\$","",TotalProjectedNetReceipts),
  Difference = gsub(",","",Difference),
  Difference = gsub("\\$","",Difference),
  PercentDifference = gsub("\\%","",PercentDifference),
  CumulativeActualNetReceipts = gsub(",","",CumulativeActualNetReceipts),
  CumulativeActualNetReceipts = gsub("\\$","",CumulativeActualNetReceipts),
  CumulativeProjectedNetReceipts = gsub(",","",CumulativeProjectedNetReceipts),
  CumulativeProjectedNetReceipts = gsub("\\$","",CumulativeProjectedNetReceipts),
  CumulativeDifference = gsub(",","",CumulativeDifference),
  CumulativeDifference = gsub("\\$","",CumulativeDifference),
  CumulativePercentDifference = gsub("\\%","",CumulativePercentDifference)
  ) |> mutate_at(vars(-Month), as.numeric) |> mutate(ReportMonth = "January 2020")

Rows: 7 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (9): Month, TotalActualNetReceipts, TotalProjectedNetReceipts, Differenc...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

receiptsDec19 <- read_csv("data/tabula-General_Fund_Receipts_December_2019.csv", skip = 3, col_names = c("Month", "TotalActualNetReceipts", "TotalProjectedNetReceipts", "Difference", "PercentDifference", "CumulativeActualNetReceipts", "CumulativeProjectedNetReceipts", "CumulativeDifference","CumulativePercentDifference"), skip_empty_rows = TRUE) |> mutate(
  TotalActualNetReceipts = gsub(",","",TotalActualNetReceipts),
  TotalActualNetReceipts = gsub("\\$","",TotalActualNetReceipts),
  TotalProjectedNetReceipts = gsub(",","",TotalProjectedNetReceipts),
  TotalProjectedNetReceipts = gsub("\\$","",TotalProjectedNetReceipts),
  Difference = gsub(",","",Difference),
  Difference = gsub("\\$","",Difference),
  PercentDifference = gsub("\\%","",PercentDifference),
  CumulativeActualNetReceipts = gsub(",","",CumulativeActualNetReceipts),
  CumulativeActualNetReceipts = gsub("\\$","",CumulativeActualNetReceipts),
  CumulativeProjectedNetReceipts = gsub(",","",CumulativeProjectedNetReceipts),
  CumulativeProjectedNetReceipts = gsub("\\$","",CumulativeProjectedNetReceipts),
  CumulativeDifference = gsub(",","",CumulativeDifference),
  CumulativeDifference = gsub("\\$","",CumulativeDifference),
  CumulativePercentDifference = gsub("\\%","",CumulativePercentDifference)
  ) |> mutate_at(vars(-Month), as.numeric) |> mutate(ReportMonth = "December 2019")

Rows: 6 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (9): Month, TotalActualNetReceipts, TotalProjectedNetReceipts, Differenc...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

receiptsNov19 <- read_csv("data/tabula-General_Fund_Receipts_November_12-13-2019.csv", skip = 3, col_names = c("Month", "TotalActualNetReceipts", "TotalProjectedNetReceipts", "Difference", "PercentDifference", "CumulativeActualNetReceipts", "CumulativeProjectedNetReceipts", "CumulativeDifference","CumulativePercentDifference"), skip_empty_rows = TRUE) |> mutate(
  TotalActualNetReceipts = gsub(",","",TotalActualNetReceipts),
  TotalActualNetReceipts = gsub("\\$","",TotalActualNetReceipts),
  TotalProjectedNetReceipts = gsub(",","",TotalProjectedNetReceipts),
  TotalProjectedNetReceipts = gsub("\\$","",TotalProjectedNetReceipts),
  Difference = gsub(",","",Difference),
  Difference = gsub("\\$","",Difference),
  PercentDifference = gsub("\\%","",PercentDifference),
  CumulativeActualNetReceipts = gsub(",","",CumulativeActualNetReceipts),
  CumulativeActualNetReceipts = gsub("\\$","",CumulativeActualNetReceipts),
  CumulativeProjectedNetReceipts = gsub(",","",CumulativeProjectedNetReceipts),
  CumulativeProjectedNetReceipts = gsub("\\$","",CumulativeProjectedNetReceipts),
  CumulativeDifference = gsub(",","",CumulativeDifference),
  CumulativeDifference = gsub("\\$","",CumulativeDifference),
  CumulativePercentDifference = gsub("\\%","",CumulativePercentDifference)
  ) |> mutate_at(vars(-Month), as.numeric) |> mutate(ReportMonth = "November 2019")

Rows: 5 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (9): Month, TotalActualNetReceipts, TotalProjectedNetReceipts, Differenc...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

All three of these datasets have the same number of columns, all with the same names, so if we want to merge them together to compare them over time, we need to stack them together. The verb here, in R, is rbind. The good news about rbind is that it is very simple. The bad news – you can only merge two things at a time.

Since we have three things, we’re going to do this in steps. First, we’ll create a dataframe that will hold it all and we’ll populate it with two of our revenue dataframes rbinded together. Then, we’ll overwrite our new dataframe with the results of that dataframe with the third revenue dataframe.

predictions1 <- rbind(receiptsJan20, receiptsDec19)

predictions2 <- rbind(predictions1, receiptsNov19)

predictions2

# A tibble: 18 × 10
   Month     TotalActualNetReceipts TotalProjectedNetReceipts Difference
   <chr>                      <dbl>                     <dbl>      <dbl>
 1 July                   284883132                 271473079   13410054
 2 August                 462019974                 440504016   21515958
 3 September              551908013                 510286143   41621870
 4 October                289723434                 266204529   23518905
 5 November               431787603                 404934524   26853079
 6 December               472926836                 421455999   51470837
 7 January                467698460                 412661036   55037424
 8 July                   284883132                 271473079   13410054
 9 August                 462019974                 440504016   21515958
10 September              551908013                 510286143   41621870
11 October                289723434                 266204529   23518905
12 November               431787603                 404934524   26853079
13 December               472926836                 421455999   51470837
14 July                   284883132                 271473079   13410054
15 August                 462019974                 440504016   21515958
16 September              551908013                 510286143   41621870
17 October                289723434                 266204529   23518905
18 November               431787603                 404934524   26853079
# ℹ 6 more variables: PercentDifference <dbl>,
#   CumulativeActualNetReceipts <dbl>, CumulativeProjectedNetReceipts <dbl>,
#   CumulativeDifference <dbl>, CumulativePercentDifference <dbl>,
#   ReportMonth <chr>

And boom, like that, we have 18 rows of data instead of three dataframes of 5, 6, and 7 respectively.

14.2 Joining data

More difficult is when you have two separate tables that are connected by a common element or elements.

Let’s return to our fatal accident data. In reality, the Fatality Analysis Reporting System data has 27 tables in it – everything from details about the damage to the paperwork done.

Let’s just merge two of them and just for the state of Nebraska – download the accidents and the people.

Often, when talking about relational data files like this, there’s substantial amounts of documentation that go with it to tell you how these things are related and what codes mean. The FARS data is no different. You should open it and click on the PERSON Data File.

ST_CASE should be used to merge the Person data file with the Accident data file for a set of all motorists and non-motorists.

So that’s what we’re going to do.

accidents <- read_csv("data/neaccidents.csv")

Rows: 201 Columns: 52
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (3): TWAY_ID, TWAY_ID2, RAIL
dbl (49): STATE, ST_CASE, VE_TOTAL, VE_FORMS, PVH_INVL, PEDS, PERNOTMVIT, PE...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

persons <- read_csv("data/nepersons.csv")

Rows: 553 Columns: 62
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (62): STATE, ST_CASE, VE_FORMS, VEH_NO, PER_NO, STR_VEH, COUNTY, DAY, MO...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

First, notice something in the environment about your dataset: there are 201 accidents but 553 persons. That means there’s not quite 3 people involved in every accident on average between drivers and passengers. Some are single car, single person crashes. Some involve a lot of people.

To put these two tables together, we need to use something called a join. There are different kinds of join. It’s better if you think of two tables sitting next to each other. A left_join takes all the records from the left table and only the records that match in the right one. A right_join does the same thing. An inner_join takes only the records where they are equal. There’s one other join – a full_join which returns all rows of both, regardless of if there’s a match – but I’ve never once had a use for a full join.

In the PERSON Data File documentation, we see that column that connects these two tables together is the ST_CASE column.

So we can do this join multiple ways and get a similar result. We can put the person file on the left and the accident on the right and use a left join to get them all together. And we use by= to join by the right column. And to avoid rendering hundreds of rows of data, I’m going to count the rows at the end. The reason I’m going this is important: Rule 1 in joining data is having an idea of what you are expecting to get. So with a left join with people on the left, I have 553 people, so I expect to get 553 rows when I’m done.

persons |> left_join(accidents, by="ST_CASE") |> nrow()

[1] 553

Remove the nrow and run it again for yourself. See how there are several columns that end with .X? That means they’re duplicates. There’s a solid chance they are the same in both tables. By default, dplyr will do a “natural” join, where it’ll match all the matching columns in both tables. So if we take out the by, it’ll use all the common columns between the tables. That may not be right – our documentation says ST_CASE is how they are related – but let’s try it. If it works, we should get 553 rows.

persons |> left_join(accidents)

Joining with `by = join_by(STATE, ST_CASE, VE_FORMS, COUNTY, DAY, MONTH, HOUR,
MINUTE, RUR_URB, FUNC_SYS, HARM_EV, MAN_COLL, SCH_BUS)`

# A tibble: 553 × 101
   STATE ST_CASE VE_FORMS VEH_NO PER_NO STR_VEH COUNTY   DAY MONTH  HOUR MINUTE
   <dbl>   <dbl>    <dbl>  <dbl>  <dbl>   <dbl>  <dbl> <dbl> <dbl> <dbl>  <dbl>
 1    31  310001        2      1      1       0     55     1     1     0     20
 2    31  310001        2      2      1       0     55     1     1     0     20
 3    31  310002        1      1      1       0    157     7     1    22      0
 4    31  310003        1      0      1       1     55    10     1     6     57
 5    31  310003        1      1      1       0     55    10     1     6     57
 6    31  310004        2      1      1       0     79    10     1    16     55
 7    31  310004        2      2      1       0     79    10     1    16     55
 8    31  310005        2      1      1       0    119    10     1     4     30
 9    31  310005        2      2      1       0    119    10     1     4     30
10    31  310006        1      0      1       1    109    11     1     6     25
# ℹ 543 more rows
# ℹ 90 more variables: RUR_URB <dbl>, FUNC_SYS <dbl>, HARM_EV <dbl>,
#   MAN_COLL <dbl>, SCH_BUS <dbl>, MAKE <dbl>, MAK_MOD <dbl>, BODY_TYP <dbl>,
#   MOD_YEAR <dbl>, TOW_VEH <dbl>, SPEC_USE <dbl>, EMER_USE <dbl>,
#   ROLLOVER <dbl>, IMPACT1 <dbl>, FIRE_EXP <dbl>, AGE <dbl>, SEX <dbl>,
#   PER_TYP <dbl>, INJ_SEV <dbl>, SEAT_POS <dbl>, REST_USE <dbl>,
#   REST_MIS <dbl>, AIR_BAG <dbl>, EJECTION <dbl>, EJ_PATH <dbl>, …

So instead of just one column, it used 13. And we got the same answer. And we don’t have any columns with .X after it anymore. So we’re good to move forward.

Let’s save our joined data to a new dataframe.

personaccidents <- persons |> left_join(accidents)

Joining with `by = join_by(STATE, ST_CASE, VE_FORMS, COUNTY, DAY, MONTH, HOUR,
MINUTE, RUR_URB, FUNC_SYS, HARM_EV, MAN_COLL, SCH_BUS)`

Now, with our joined data, we can answer questions that come from both datasets. So what if we looked at median age of drivers who died broken down by what kind of roadway the accident happened on? We can do this now because the accident data has the roadway information information and the age and who was driving and what type of injury they sustained comes from the person table.

We get this by using filters followed by a group by and summarize. In the data documentation linked above, look in the PERSON Data File to get the appropriate filters. In this case, we want PER_TYPE of 1 (the driver) and an INJ_SEV of 4, which means death. In the ACCCIDENT Data File section, we learn it’s the ROUTE we want to group by.

personaccidents |> 
  filter(PER_TYP == 1) |> 
  filter(INJ_SEV == 4) |> 
  group_by(ROUTE) |> 
  summarize(
    count = n(), 
    avgage = mean(AGE), 
    medage = median(AGE))

# A tibble: 5 × 4
  ROUTE count avgage medage
  <dbl> <int>  <dbl>  <dbl>
1     1    15   40.5     33
2     2    51   53.1     51
3     3    31   42.3     42
4     4    37   37.3     36
5     6    18   40.4     35

According to our query, 15 accidents happened on interstates, and the median age of those was the lowest of all at 33. The most accidents were on US Highways, which makes sense because there’s a lot more lane miles of US Highways than Interstates in Nebraska and pretty much every other state. But the second most common is county roads. And the median age of drivers there was quite low at 36.

Let’s break this down one more step. What if we added RUR_URB – if the accident happened in rural or urban place. A common feeling in a rural state like Nebraska is that urban interstate is a ribbon of insanity. But is it?

personaccidents |> 
  filter(PER_TYP == 1) |> 
  filter(INJ_SEV == 4) |> 
  group_by(RUR_URB, ROUTE) |> 
  summarize(
    count = n(), 
    avgage = mean(AGE), 
    medage = median(AGE))

`summarise()` has grouped output by 'RUR_URB'. You can override using the
`.groups` argument.

# A tibble: 8 × 5
# Groups:   RUR_URB [2]
  RUR_URB ROUTE count avgage medage
    <dbl> <dbl> <int>  <dbl>  <dbl>
1       1     1    12   45.3   42.5
2       1     2    41   52.5   51  
3       1     3    27   41.9   42  
4       1     4    37   37.3   36  
5       2     1     3   21.3   21  
6       2     2    10   55.3   54.5
7       2     3     4   45     35  
8       2     6    18   40.4   35

In 2018, only 3 of the 15 deaths on interestates were in urban areas. All the rest were in rural areas. And all 37 drivers who died in accidents on county roads were in rural areas. Most of the driver deaths on US Highways were in rural places as well.