The bane of every data analyst’s existence is data cleaning.
Every developer, every data system, every agency, the all have opinions about how data gets collected. Some decisions make sense from the outside. Some decisions are based entirely on internal poltics: who is creating the data, how they are creating it, why they are creating it. Is it automated? Is it manual? Are data normalized? Are there free form fields where users can just type into or does the system restrict them to choices?
Your question – what you want to do with the data – is almost never part of that equation.
So cleaning data is the process of fixing issues in your data so you can answer the questions you want to answer. Unfortunately, there’s no template here. There’s no checklist. It’s just a big bag of tricks that eventually runs out and you’ll be left fixing individual issues by hand, if it’s really bad.
But let’s start simple. There are certain things that need we can start with that will make our lives easier. We’ll slowly make it harder as we dig deeper.
11.1 Cleaning headers
One of the first places we can start with cleaning data is cleaning the headers. Every system has their own way of recording headers, and every developer has their own thoughts of what a good idea is within it. R is most happy when headers are one word, lower case, without special characters. If you’ve noticed readr output with backticks around headers like Incident Date, it’s because of the space. Headers that start with numbers or are just a number – 2002 – also get backticks in readr.
There is an external library in R called janitor that makes fixing headers trivially simple. You can install it by running install.packages("janitor") in your console.
Load libraries like we always do.
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.2 ✔ readr 2.1.4
✔ forcats 1.0.0 ✔ stringr 1.5.0
✔ ggplot2 3.4.4 ✔ tibble 3.2.1
✔ lubridate 1.9.2 ✔ tidyr 1.3.0
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(janitor)
Attaching package: 'janitor'
The following objects are masked from 'package:stats':
chisq.test, fisher.test
Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
dat <- vroom(...)
problems(dat)
Rows: 7674 Columns: 32
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (22): COMMITTED LAST NAME, FIRST NAME...3, MIDDLE NAME...4, NAME EXTENSI...
dbl (6): ID NUMBER, MIN MONTH, MIN DAY, MAX MONTH, MAX DAY, GOOD TIME LAW
lgl (4): NAME EXTENSION...9, GUN CLAUSE, ...31, ...32
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
From the output of readr, you can see all kinds of problems from the get go. Two columns are missing names entirely. Three columns repeat – first name, middle name and name extension. And many field names have spaces or other not-allowed characters. Not to mention: All of them are in ALL CAPS.
Janitor makes this easy to fix. How easy? This easy.
inmates |>clean_names()
# A tibble: 7,674 × 32
id_number committed_last_name first_name_3 middle_name_4 name_extension_5
<dbl> <chr> <chr> <chr> <chr>
1 6145 KANE THOMAS <NA> <NA>
2 20841 ARNOLD WILLIAM L <NA>
3 25324 WALKER RICHARD T <NA>
4 25565 ALVAREZ THOMAS A <NA>
5 26103 ADAMS BRIAN J <NA>
6 26547 KIRBY RONALD EUGENE <NA>
7 27471 NIEMANN MAX A <NA>
8 27666 ORTIZ LAWRENCE <NA> <NA>
9 27767 POINDEXTER EDWARD <NA> <NA>
10 27778 DITTRICH LADDIE F <NA>
# ℹ 7,664 more rows
# ℹ 27 more variables: legal_last_name <chr>, first_name_7 <chr>,
# middle_name_8 <chr>, name_extension_9 <lgl>, date_of_birth <chr>,
# race_desc <chr>, gender <chr>, facility <chr>,
# current_sentence_pardoned_or_commuted_date <chr>, gun_clause <lgl>,
# sentence_begin_date <chr>, min_term_year <chr>, min_month <dbl>,
# min_day <dbl>, max_term_year <chr>, max_month <dbl>, max_day <dbl>, …
Just like that, all lower case, all one word, no backticks necessary to confuse our code later on.
Another thing janitor does well is to make it easy to drop empty columns. Remember the two unnamed columns in the data? Turns out they’re unnamamed because there’s nothing in them. Nada. Blank. We could use select but janitor reduces the labor involved there.
First, let’s see how many columns we have.
inmates |>ncol()
[1] 32
And by using remove_empty("cols"), how many do we get rid of?
One of the most difficult problems to fix in data is duplicates in the data. They can creep in with bad joins, bad data entry practices, mistakes – all kinds of reasons. One trick is determining if a duplicate is indeed a duplicate.
So the question is, do we have any inmates repeated? Anyone in prison twice?
Here we’ll use a function called get_dupes. And we’ll use the inmate’s last name, first name and date of birth. The likelihood that someone has the same name and date of birth is very small, so if there are no duplicates, we should get zero records returned.
Uh oh. We get two Pamela Wallaces born on the same day in 1966. But is it a duplicate record? Look closely. Two different id_numbers. In two different facilities on two different dates. Two different sentencing dates. Is it a duplicate record or the same person entering the system two different times?
11.3 Inconsistency
Janitor also has some handy tools for our data smells. One is called tabyl, which creates a table of unique records in a single field.
So does the Department of Corrections record gender consistently? tabyl will tell us and will tell us a little bit about the data.
clean_headers_inmates |>tabyl(gender)
gender n percent
FEMALE 732 0.09538702
MALE 6942 0.90461298
So the Department of Corrections clearly doesn’t buy into more modern sentiments about gender, but they are at least consistent. Every inmate has a gender – no NAs – and note that 90 percent of inmates are men.
How about race?
clean_headers_inmates |>tabyl(race_desc)
race_desc n percent valid_percent
ASIAN 61 0.0079489184 0.0079520271
BLACK 2037 0.2654417514 0.2655455612
HISPANIC 1059 0.1379984363 0.1380524052
NATIVE AMERICAN 349 0.0454782382 0.0454960240
OTHER 56 0.0072973677 0.0073002216
PACIFIC ISLANDER 7 0.0009121710 0.0009125277
WHITE 4102 0.5345321866 0.5347412332
<NA> 3 0.0003909304 NA
Three people do not have a race – and according to the Census Bureau, Hispanic is not a race, it’s an ethnicity – but otherwise, it looks solid. There’s very little in the way of inconsistency.
How about what facilities they are in?
clean_headers_inmates |>tabyl(facility)
facility n percent valid_percent
COMMUNITY CORRECTIONS-LINCOLN 1276 0.16627574 0.16887242
COMMUNITY CORRECTIONS-OMAHA 368 0.04795413 0.04870302
DIAGNOSTIC & EVALUATION CENTER 778 0.10138129 0.10296453
LINCOLN CORRECTIONAL CENTER 571 0.07440709 0.07556908
NEBRASKA CORR CENTER FOR WOMEN 480 0.06254887 0.06352567
NEBRASKA CORR YOUTH FACILTY 83 0.01081574 0.01098465
NEBRASKA STATE PENITENTIARY 1588 0.20693250 0.21016411
OMAHA CORRECTIONAL CENTER 1059 0.13799844 0.14015352
TECUMSEH STATE COR INSTITUTION 1104 0.14386239 0.14610905
WORK ETHIC CAMP 249 0.03244722 0.03295394
<NA> 118 0.01537660 NA
Not sure how I feel about 118 inmates not having a facility. That’s probably worth investigating. At least one, I know about – it lists the inmate as having escaped in the 1960s and never found. Not sure about the others.
But sometimes, NAs are not bad data. Sometimes they’re just NA. Let’s look at inst_release_type or how inmates were released.
clean_headers_inmates |>tabyl(inst_release_type)
inst_release_type n percent valid_percent
DISCRETIONARY PAROLE 1391 0.1812614021 0.5506730008
ESCAPE 51 0.0066458170 0.0201900238
MANDATORY PAROLE 2 0.0002606203 0.0007917656
RE-PAROLE 3 0.0003909304 0.0011876485
RELEASED TO PRS 1079 0.1406046390 0.4271575614
<NA> 5148 0.6708365911 NA
By far the largest group here is NA. Why is that? They haven’t been released yet. They’re still in prison.