10 Data Cleaning Part I: Data smells

Any time you are given a dataset from anyone, you should immediately be suspicious. Is this data what I think it is? Does it include what I expect? Is there anything I need to know about it? Will it produce the information I expect?

One of the first things you should do is give it the smell test.

Failure to give data the smell test can lead you to miss stories and get your butt kicked on a competitive story.

Let’s look at some campus parking ticket data. You can get it here.

With data smells, we’re trying to find common mistakes in data. For more on data smells, read the GitHub wiki post that started it all. The common mistakes we’re looking for are:,

Missing data
Gaps in data
Wrong type of data
Outliers
Sharp curves
Conflicting information within a dataset
Conflicting information across datasets
Wrongly derived data
Internal inconsistency
External inconsistency
Wrong spatial data
Unusable data, including non-standard abbreviations, ambiguous data, extraneous data, inconsistent data

Not all of these data smells are detectable in code. You may have to ask people about the data. You may have to compare it to another dataset yourself. Does the agency that uses the data produce reports from the data? Does your analysis match those reports? That will expose wrongly derived data, or wrong units, or mistakes you made with inclusion or exclusion.

But with several of these data smells, we can do them first, before we do anything else.

10.1 Wrong Type

First, let’s look at Wrong Type Of Data. We can sniff that out by looking at the output of readr

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

tickets <- read_csv("data/tickets.csv")

Rows: 161315 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (3): Citation, Location, Violation
dttm (1): Date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Right away, we see there’s 104,265 parsing errors. Why? Look closely. The Citation number that readr interprets from the first rows comes in at a number. But 56,000 rows in, those citation numbers start having letters in them, and letters are not numbers.

The cheap way to fix this is to change the guess_max parameter of readr to just use more than a few rows to guess the column types. It’ll go a little slower, but it’ll fix the problem.

tickets <- read_csv("data/tickets.csv", guess_max = 60000)

Rows: 161315 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (3): Citation, Location, Violation
dttm (1): Date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

For this, things seem to be good. Date appears to be in date format, things that aren’t numbers appear to be text. That’s a good start.

10.2 Missing Data

The second smell we can find in code is missing data. We can do that through a series of Group By and Count steps.

tickets |> group_by(Location) |> tally()

# A tibble: 247 × 2
   Location                        n
   <chr>                       <int>
 1 1001 Y Street                 508
 2 10th & Q Street               303
 3 10th & U Street               222
 4 1101 Y Street                  83
 5 11th & Y Street                38
 6 1235 Military Road             33
 7 1320 Q                          1
 8 13th & R Lot                 4918
 9 14th & Avery Lot             1601
10 14th & Avery Parking Garage  2494
# ℹ 237 more rows

What we’re looking for here are blanks: Tickets without a location. Typically, those will appear first or last, depending on several things, so it’s worth checking the front and back of our data.

What about ticket type?

tickets |> group_by(Violation) |> tally()

# A tibble: 25 × 2
   Violation                          n
   <chr>                          <int>
 1 Damage Property                   25
 2 Displaying Altered Permit      23280
 3 Displaying Counterfeit Permit     18
 4 Displaying Stolen Permit           4
 5 Expired Meter                  45072
 6 Failure to Pay[on exit]          251
 7 Failure to Reg. Veh to Permit     53
 8 Failure to Register Veh w/ UNL   113
 9 False Lost/Stolen Permit Rept    927
10 Falsify Permit Application         3
# ℹ 15 more rows

None here either, so that’s good. It means our tickets will always have a location and a violation type.

10.3 Gaps in data

Let’s now look at gaps in data. It’s been my experience that gaps in data often have to do with time, so let’s first look at ticket dates, so we can see if there’s any big jumps in data. You’d expect the numbers to change, but not by huge amounts. Huge change would indicate, more often than not, that the data is missing. Let’s start with Date. If we’re going to work with dates, we should have lubridate handy for floor_date.

library(lubridate)

tickets |> group_by(floor_date(Date, "month")) |> tally()

# A tibble: 56 × 2
   `floor_date(Date, "month")`     n
   <dttm>                      <int>
 1 2012-04-01 00:00:00          3473
 2 2012-05-01 00:00:00          2572
 3 2012-06-01 00:00:00          2478
 4 2012-07-01 00:00:00          2134
 5 2012-08-01 00:00:00          3774
 6 2012-09-01 00:00:00          4138
 7 2012-10-01 00:00:00          4173
 8 2012-11-01 00:00:00          3504
 9 2012-12-01 00:00:00          1593
10 2013-01-01 00:00:00          3078
# ℹ 46 more rows

First thing to notice: our data starts in April 2012. So 2012 isn’t a complete year. Then, scroll through. Look at December 2013 - March 2014. The number of tickets drops to about 10 percent of normal. That’s … odd. And then let’s look at the end – November 2016. So not a complete year in 2016 either.

10.4 Internal inconsistency

Any time you are going to focus on something, you should check it for consistency inside the data set. So let’s pretend the large number of Displaying Altered Permit tickets caught your attention and you want to do a story about tens of thousands of students being caught altering their parking permits to reduce their costs parking on campus. Good story right? Before you go calling the parking office for comment, I’d check that data first.

tickets |> filter(Violation == "Displaying Altered Permit") |> group_by(floor_date(Date, "month")) |> tally()

# A tibble: 29 × 2
   `floor_date(Date, "month")`     n
   <dttm>                      <int>
 1 2012-04-01 00:00:00          1072
 2 2012-05-01 00:00:00          1283
 3 2012-06-01 00:00:00          1324
 4 2012-07-01 00:00:00          1357
 5 2012-08-01 00:00:00          2249
 6 2012-09-01 00:00:00          1797
 7 2012-10-01 00:00:00          1588
 8 2012-11-01 00:00:00          1183
 9 2012-12-01 00:00:00           458
10 2013-01-01 00:00:00          1132
# ℹ 19 more rows

So this charge exists when our data starts, but scroll forward: In October 2013, there’s 1,081 tickets written. A month later, only 121. A month after that? 1. And then one sporadically for three more years.

Something major changed. What is it? That’s why you are a reporter. Go ask. But we know our data doesn’t support the story we started with.

And that’s what Data Smells are designed to do: stop you from going down a bad path.

10.5 A Shortcut: Summary

One quick way to get some data smells is to use R’s built in summary function. What summary does is run summary statistics on each column of your dataset. Some of the output is … underwhelming … but some is really useful. For example, looking at min and max for dates can point to bad data there. Min and max will also show you out of range numbers – numbers far too big or small to make sense.

The syntax is simple.

summary(tickets)

   Citation              Date                          Location        
 Length:161315      Min.   :2012-04-02 07:15:00.00   Length:161315     
 Class :character   1st Qu.:2013-05-06 09:42:30.00   Class :character  
 Mode  :character   Median :2014-10-17 12:03:00.00   Mode  :character  
                    Mean   :2014-08-13 19:36:52.66                     
                    3rd Qu.:2015-10-08 07:31:30.00                     
                    Max.   :2016-11-03 13:59:19.00                     
  Violation        
 Length:161315     
 Class :character  
 Mode  :character

In this case, the output doesn’t do much for us except dates. Looking at the min and max will tell us if we have any out of range dates. In this case, we do not.