Any time you are given a dataset from anyone, you should immediately be suspicious. Is this data what I think it is? Does it include what I expect? Is there anything I need to know about it? Will it produce the information I expect?
One of the first things you should do is give it the smell test.
Unusable data, including non-standard abbreviations, ambiguous data, extraneous data, inconsistent data
Not all of these data smells are detectable in code. You may have to ask people about the data. You may have to compare it to another dataset yourself. Does the agency that uses the data produce reports from the data? Does your analysis match those reports? That will expose wrongly derived data, or wrong units, or mistakes you made with inclusion or exclusion.
But with several of these data smells, we can do them first, before we do anything else.
10.1 Wrong Type
First, let’s look at Wrong Type Of Data. We can sniff that out by looking at the output of readr
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.2 ✔ readr 2.1.4
✔ forcats 1.0.0 ✔ stringr 1.5.0
✔ ggplot2 3.4.4 ✔ tibble 3.2.1
✔ lubridate 1.9.2 ✔ tidyr 1.3.0
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
tickets <-read_csv("data/tickets.csv")
Rows: 161315 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): Citation, Location, Violation
dttm (1): Date
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Right away, we see there’s 104,265 parsing errors. Why? Look closely. The Citation number that readr interprets from the first rows comes in at a number. But 56,000 rows in, those citation numbers start having letters in them, and letters are not numbers.
The cheap way to fix this is to change the guess_max parameter of readr to just use more than a few rows to guess the column types. It’ll go a little slower, but it’ll fix the problem.
Rows: 161315 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): Citation, Location, Violation
dttm (1): Date
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
For this, things seem to be good. Date appears to be in date format, things that aren’t numbers appear to be text. That’s a good start.
10.2 Missing Data
The second smell we can find in code is missing data. We can do that through a series of Group By and Count steps.
tickets |>group_by(Location) |>tally()
# A tibble: 247 × 2
Location n
<chr> <int>
1 1001 Y Street 508
2 10th & Q Street 303
3 10th & U Street 222
4 1101 Y Street 83
5 11th & Y Street 38
6 1235 Military Road 33
7 1320 Q 1
8 13th & R Lot 4918
9 14th & Avery Lot 1601
10 14th & Avery Parking Garage 2494
# ℹ 237 more rows
What we’re looking for here are blanks: Tickets without a location. Typically, those will appear first or last, depending on several things, so it’s worth checking the front and back of our data.
What about ticket type?
tickets |>group_by(Violation) |>tally()
# A tibble: 25 × 2
Violation n
<chr> <int>
1 Damage Property 25
2 Displaying Altered Permit 23280
3 Displaying Counterfeit Permit 18
4 Displaying Stolen Permit 4
5 Expired Meter 45072
6 Failure to Pay[on exit] 251
7 Failure to Reg. Veh to Permit 53
8 Failure to Register Veh w/ UNL 113
9 False Lost/Stolen Permit Rept 927
10 Falsify Permit Application 3
# ℹ 15 more rows
None here either, so that’s good. It means our tickets will always have a location and a violation type.
10.3 Gaps in data
Let’s now look at gaps in data. It’s been my experience that gaps in data often have to do with time, so let’s first look at ticket dates, so we can see if there’s any big jumps in data. You’d expect the numbers to change, but not by huge amounts. Huge change would indicate, more often than not, that the data is missing. Let’s start with Date. If we’re going to work with dates, we should have lubridate handy for floor_date.
First thing to notice: our data starts in April 2012. So 2012 isn’t a complete year. Then, scroll through. Look at December 2013 - March 2014. The number of tickets drops to about 10 percent of normal. That’s … odd. And then let’s look at the end – November 2016. So not a complete year in 2016 either.
10.4 Internal inconsistency
Any time you are going to focus on something, you should check it for consistency inside the data set. So let’s pretend the large number of Displaying Altered Permit tickets caught your attention and you want to do a story about tens of thousands of students being caught altering their parking permits to reduce their costs parking on campus. Good story right? Before you go calling the parking office for comment, I’d check that data first.
So this charge exists when our data starts, but scroll forward: In October 2013, there’s 1,081 tickets written. A month later, only 121. A month after that? 1. And then one sporadically for three more years.
Something major changed. What is it? That’s why you are a reporter. Go ask. But we know our data doesn’t support the story we started with.
And that’s what Data Smells are designed to do: stop you from going down a bad path.
10.5 A Shortcut: Summary
One quick way to get some data smells is to use R’s built in summary function. What summary does is run summary statistics on each column of your dataset. Some of the output is … underwhelming … but some is really useful. For example, looking at min and max for dates can point to bad data there. Min and max will also show you out of range numbers – numbers far too big or small to make sense.
The syntax is simple.
summary(tickets)
Citation Date Location
Length:161315 Min. :2012-04-02 07:15:00.00 Length:161315
Class :character 1st Qu.:2013-05-06 09:42:30.00 Class :character
Mode :character Median :2014-10-17 12:03:00.00 Mode :character
Mean :2014-08-13 19:36:52.66
3rd Qu.:2015-10-08 07:31:30.00
Max. :2016-11-03 13:59:19.00
Violation
Length:161315
Class :character
Mode :character
In this case, the output doesn’t do much for us except dates. Looking at the min and max will tell us if we have any out of range dates. In this case, we do not.