11 Data Cleaning Part II: Janitor

The bane of every data analyst’s existence is data cleaning.

Every developer, every data system, every agency, the all have opinions about how data gets collected. Some decisions make sense from the outside. Some decisions are based entirely on internal poltics: who is creating the data, how they are creating it, why they are creating it. Is it automated? Is it manual? Are data normalized? Are there free form fields where users can just type into or does the system restrict them to choices?

Your question – what you want to do with the data – is almost never part of that equation.

So cleaning data is the process of fixing issues in your data so you can answer the questions you want to answer. Unfortunately, there’s no template here. There’s no checklist. It’s just a big bag of tricks that eventually runs out and you’ll be left fixing individual issues by hand, if it’s really bad.

But let’s start simple. There are certain things that need we can start with that will make our lives easier. We’ll slowly make it harder as we dig deeper.

11.1 Cleaning headers

One of the first places we can start with cleaning data is cleaning the headers. Every system has their own way of recording headers, and every developer has their own thoughts of what a good idea is within it. R is most happy when headers are one word, lower case, without special characters. If you’ve noticed readr output with backticks around headers like Incident Date, it’s because of the space. Headers that start with numbers or are just a number – 2002 – also get backticks in readr.

There is an external library in R called janitor that makes fixing headers trivially simple. You can install it by running install.packages("janitor") in your console.

Load libraries like we always do.

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(janitor)


Attaching package: 'janitor'

The following objects are masked from 'package:stats':

    chisq.test, fisher.test

Let’s load a new dataset – the list of active inmates in the Nebraska prison system.

inmates <- read_csv("data/activeinmates.csv")

New names:
• `FIRST NAME` -> `FIRST NAME...3`
• `MIDDLE NAME` -> `MIDDLE NAME...4`
• `NAME EXTENSION` -> `NAME EXTENSION...5`
• `FIRST NAME` -> `FIRST NAME...7`
• `MIDDLE NAME` -> `MIDDLE NAME...8`
• `NAME EXTENSION` -> `NAME EXTENSION...9`
• `` -> `...31`
• `` -> `...32`

Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat)

Rows: 7674 Columns: 32
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (22): COMMITTED LAST NAME, FIRST NAME...3, MIDDLE NAME...4, NAME EXTENSI...
dbl  (6): ID NUMBER, MIN MONTH, MIN DAY, MAX MONTH, MAX DAY, GOOD TIME LAW
lgl  (4): NAME EXTENSION...9, GUN CLAUSE, ...31, ...32

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

From the output of readr, you can see all kinds of problems from the get go. Two columns are missing names entirely. Three columns repeat – first name, middle name and name extension. And many field names have spaces or other not-allowed characters. Not to mention: All of them are in ALL CAPS.

Janitor makes this easy to fix. How easy? This easy.

inmates |> clean_names()

# A tibble: 7,674 × 32
   id_number committed_last_name first_name_3 middle_name_4 name_extension_5
       <dbl> <chr>               <chr>        <chr>         <chr>           
 1      6145 KANE                THOMAS       <NA>          <NA>            
 2     20841 ARNOLD              WILLIAM      L             <NA>            
 3     25324 WALKER              RICHARD      T             <NA>            
 4     25565 ALVAREZ             THOMAS       A             <NA>            
 5     26103 ADAMS               BRIAN        J             <NA>            
 6     26547 KIRBY               RONALD       EUGENE        <NA>            
 7     27471 NIEMANN             MAX          A             <NA>            
 8     27666 ORTIZ               LAWRENCE     <NA>          <NA>            
 9     27767 POINDEXTER          EDWARD       <NA>          <NA>            
10     27778 DITTRICH            LADDIE       F             <NA>            
# ℹ 7,664 more rows
# ℹ 27 more variables: legal_last_name <chr>, first_name_7 <chr>,
#   middle_name_8 <chr>, name_extension_9 <lgl>, date_of_birth <chr>,
#   race_desc <chr>, gender <chr>, facility <chr>,
#   current_sentence_pardoned_or_commuted_date <chr>, gun_clause <lgl>,
#   sentence_begin_date <chr>, min_term_year <chr>, min_month <dbl>,
#   min_day <dbl>, max_term_year <chr>, max_month <dbl>, max_day <dbl>, …

Just like that, all lower case, all one word, no backticks necessary to confuse our code later on.

Another thing janitor does well is to make it easy to drop empty columns. Remember the two unnamed columns in the data? Turns out they’re unnamamed because there’s nothing in them. Nada. Blank. We could use select but janitor reduces the labor involved there.

First, let’s see how many columns we have.

inmates |> ncol()

[1] 32

And by using remove_empty("cols"), how many do we get rid of?

inmates |> clean_names() |> remove_empty("cols") |> ncol()

[1] 28

So this tells us there’s three completely empty columns in our data. So why keep them around.

So we can run all of this together and get a dataset with useful columns and clean headers.

inmates |> clean_names() |> remove_empty("cols") -> clean_headers_inmates

11.2 Duplicates

One of the most difficult problems to fix in data is duplicates in the data. They can creep in with bad joins, bad data entry practices, mistakes – all kinds of reasons. One trick is determining if a duplicate is indeed a duplicate.

So the question is, do we have any inmates repeated? Anyone in prison twice?

Here we’ll use a function called get_dupes. And we’ll use the inmate’s last name, first name and date of birth. The likelihood that someone has the same name and date of birth is very small, so if there are no duplicates, we should get zero records returned.

clean_headers_inmates |> get_dupes(committed_last_name, first_name_3, date_of_birth)

# A tibble: 2 × 29
  committed_last_name first_name_3 date_of_birth dupe_count id_number
  <chr>               <chr>        <chr>              <int>     <dbl>
1 WALLACE             PAMELA       7/11/1966              2     99240
2 WALLACE             PAMELA       7/11/1966              2     99955
# ℹ 24 more variables: middle_name_4 <chr>, name_extension_5 <chr>,
#   legal_last_name <chr>, first_name_7 <chr>, middle_name_8 <chr>,
#   race_desc <chr>, gender <chr>, facility <chr>,
#   current_sentence_pardoned_or_commuted_date <chr>,
#   sentence_begin_date <chr>, min_term_year <chr>, min_month <dbl>,
#   min_day <dbl>, max_term_year <chr>, max_month <dbl>, max_day <dbl>,
#   parole_eligibility_date <chr>, earliest_possible_release_date <chr>, …

Uh oh. We get two Pamela Wallaces born on the same day in 1966. But is it a duplicate record? Look closely. Two different id_numbers. In two different facilities on two different dates. Two different sentencing dates. Is it a duplicate record or the same person entering the system two different times?

11.3 Inconsistency

Janitor also has some handy tools for our data smells. One is called tabyl, which creates a table of unique records in a single field.

So does the Department of Corrections record gender consistently? tabyl will tell us and will tell us a little bit about the data.

clean_headers_inmates |> tabyl(gender)

 gender    n    percent
 FEMALE  732 0.09538702
   MALE 6942 0.90461298

So the Department of Corrections clearly doesn’t buy into more modern sentiments about gender, but they are at least consistent. Every inmate has a gender – no NAs – and note that 90 percent of inmates are men.

How about race?

clean_headers_inmates |> tabyl(race_desc)

        race_desc    n      percent valid_percent
            ASIAN   61 0.0079489184  0.0079520271
            BLACK 2037 0.2654417514  0.2655455612
         HISPANIC 1059 0.1379984363  0.1380524052
  NATIVE AMERICAN  349 0.0454782382  0.0454960240
            OTHER   56 0.0072973677  0.0073002216
 PACIFIC ISLANDER    7 0.0009121710  0.0009125277
            WHITE 4102 0.5345321866  0.5347412332
             <NA>    3 0.0003909304            NA

Three people do not have a race – and according to the Census Bureau, Hispanic is not a race, it’s an ethnicity – but otherwise, it looks solid. There’s very little in the way of inconsistency.

How about what facilities they are in?

clean_headers_inmates |> tabyl(facility)

                       facility    n    percent valid_percent
  COMMUNITY CORRECTIONS-LINCOLN 1276 0.16627574    0.16887242
    COMMUNITY CORRECTIONS-OMAHA  368 0.04795413    0.04870302
 DIAGNOSTIC & EVALUATION CENTER  778 0.10138129    0.10296453
    LINCOLN CORRECTIONAL CENTER  571 0.07440709    0.07556908
 NEBRASKA CORR CENTER FOR WOMEN  480 0.06254887    0.06352567
    NEBRASKA CORR YOUTH FACILTY   83 0.01081574    0.01098465
    NEBRASKA STATE PENITENTIARY 1588 0.20693250    0.21016411
      OMAHA CORRECTIONAL CENTER 1059 0.13799844    0.14015352
 TECUMSEH STATE COR INSTITUTION 1104 0.14386239    0.14610905
                WORK ETHIC CAMP  249 0.03244722    0.03295394
                           <NA>  118 0.01537660            NA

Not sure how I feel about 118 inmates not having a facility. That’s probably worth investigating. At least one, I know about – it lists the inmate as having escaped in the 1960s and never found. Not sure about the others.

But sometimes, NAs are not bad data. Sometimes they’re just NA. Let’s look at inst_release_type or how inmates were released.

clean_headers_inmates |> tabyl(inst_release_type)

    inst_release_type    n      percent valid_percent
 DISCRETIONARY PAROLE 1391 0.1812614021  0.5506730008
               ESCAPE   51 0.0066458170  0.0201900238
     MANDATORY PAROLE    2 0.0002606203  0.0007917656
            RE-PAROLE    3 0.0003909304  0.0011876485
      RELEASED TO PRS 1079 0.1406046390  0.4271575614
                 <NA> 5148 0.6708365911            NA

By far the largest group here is NA. Why is that? They haven’t been released yet. They’re still in prison.