8 Working with dates

One of the most frustrating things in data is working with dates. Everyone has a different opinion on how to record them, and every software package on the planet has to sort it out. Dealing with it can be a little … confusing. And every dataset has something new to throw at you. So consider this an introduction.

We’re going to do this two ways. First I’m going to show you how to use base R to solve a tricky problem. And then we’ll use a library called lubridate to solve a more common and less tricky problem. And then we’ll use a new library to solve most of the common problems before they start.

8.1 The hard way

First, we’ll import tidyverse like we always do.

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

We’re going to use a dataset of parking tickets at UNL. If we do this the old way – using read.csv – this is what we get:

tickets <- read.csv("data/tickets.csv")
head(tickets)

  Citation                Date        Location                    Violation
1 15078429 2012-04-02 07:15:00   North Stadium                Expired Meter
2 24048318 2012-04-02 07:22:00         Housing    No Valid Permit Displayed
3 24048320 2012-04-02 07:26:00 14th & W Street    No Valid Permit Displayed
4 15078430 2012-04-02 07:36:00  Champions Club Parking in Unauthorized Area
5 18074937 2012-04-02 07:39:00          Sandoz                Expired Meter
6 18074938 2012-04-02 07:40:00          Sandoz                Expired Meter

Note the date is a factor, not a date. We have to fix that. There’s a lot of ways to fix dates. The base R way is to use formatting. The code is … a little odd … but it’s useful to know if you get a really odd date format. What you are doing is essentially parsing the date into it’s component parts then reassmbling it into a date using formatting.

newtickets <- tickets |> mutate(
    CleanDate = as.POSIXct(Date, format="%Y-%m-%d %H:%M:%S")
)

head(newtickets)

  Citation                Date        Location                    Violation
1 15078429 2012-04-02 07:15:00   North Stadium                Expired Meter
2 24048318 2012-04-02 07:22:00         Housing    No Valid Permit Displayed
3 24048320 2012-04-02 07:26:00 14th & W Street    No Valid Permit Displayed
4 15078430 2012-04-02 07:36:00  Champions Club Parking in Unauthorized Area
5 18074937 2012-04-02 07:39:00          Sandoz                Expired Meter
6 18074938 2012-04-02 07:40:00          Sandoz                Expired Meter
            CleanDate
1 2012-04-02 07:15:00
2 2012-04-02 07:22:00
3 2012-04-02 07:26:00
4 2012-04-02 07:36:00
5 2012-04-02 07:39:00
6 2012-04-02 07:40:00

CleanDate is now a special date format that includes times.

You can almost read the code that created it: The format of the date is %Y, which means a four digit year DASH %m or two digit month DASH %d or two digit day SPACE %H or two digit hour COLON %M or two digit minute COLON %S or two digit second. You can remix that as you need. If you had a date that was 20021212 then you would do format="%Y%m%d" and so on.

There is a library called lubridate that can parse some common date problems. If it’s not already installed, just run install.packages('lubridate')

library(lubridate)

Lubridate can handle this tickets data easier with one of it’s many functions. The functions parse dates given a basic pattern. In this case, our data is in a very common pattern of year month date hours minutes seconds. Lubridate has a function called ymd_hms.

lubridatetickets <- tickets |> mutate(
    CleanDate = ymd_hms(Date)
)

head(lubridatetickets)

  Citation                Date        Location                    Violation
1 15078429 2012-04-02 07:15:00   North Stadium                Expired Meter
2 24048318 2012-04-02 07:22:00         Housing    No Valid Permit Displayed
3 24048320 2012-04-02 07:26:00 14th & W Street    No Valid Permit Displayed
4 15078430 2012-04-02 07:36:00  Champions Club Parking in Unauthorized Area
5 18074937 2012-04-02 07:39:00          Sandoz                Expired Meter
6 18074938 2012-04-02 07:40:00          Sandoz                Expired Meter
            CleanDate
1 2012-04-02 07:15:00
2 2012-04-02 07:22:00
3 2012-04-02 07:26:00
4 2012-04-02 07:36:00
5 2012-04-02 07:39:00
6 2012-04-02 07:40:00

That’s less code and less weirdness, so that’s good.

But to get clean data, I’ve installed a library and created a new field so I can now start to work with my dates. That seems like a lot, but don’t think your data will always be perfect and you won’t have to do these things.

Still, there’s got to be a better way. And there is.

Fortunately, readr anticipates some date formattings and can automatically handle this (indeed it uses lubridate under the hood). The change in your code? You just use read_csv instead of read.csv

tickets <- read_csv("data/tickets.csv")

Rows: 161315 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (3): Citation, Location, Violation
dttm (1): Date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(tickets)

# A tibble: 6 × 4
  Citation Date                Location        Violation                   
  <chr>    <dttm>              <chr>           <chr>                       
1 15078429 2012-04-02 07:15:00 North Stadium   Expired Meter               
2 24048318 2012-04-02 07:22:00 Housing         No Valid Permit Displayed   
3 24048320 2012-04-02 07:26:00 14th & W Street No Valid Permit Displayed   
4 15078430 2012-04-02 07:36:00 Champions Club  Parking in Unauthorized Area
5 18074937 2012-04-02 07:39:00 Sandoz          Expired Meter               
6 18074938 2012-04-02 07:40:00 Sandoz          Expired Meter

And just like that, the dates are formatted correctly.

But you’re not done with lubridate yet. It has some interesting pieces parts we’ll use elsewhere.

What’s a question you might have about parking tickets on campus involving dates?

How about what month are the most tickets issued? We could use formatting to create a Month field but that would group all the Aprils ever together. We could create a year and a month together, but that would give us an invalid date object and that would create problems later. Lubridate has something called a floor date that we can use.

So to follow along here, we’re going to use mutate to create a month field, group by to lump them together, summarize to count them up and arrange to order them. We’re just chaining things together.

tickets |> 
  mutate(Month = floor_date(Date, "month")) |> 
  group_by(Month) |> 
  summarise(total = n()) |>
  arrange(desc(total))

# A tibble: 56 × 2
   Month               total
   <dttm>              <int>
 1 2014-10-01 00:00:00  5177
 2 2015-04-01 00:00:00  4913
 3 2014-09-01 00:00:00  4645
 4 2015-09-01 00:00:00  4541
 5 2015-10-01 00:00:00  4403
 6 2015-03-01 00:00:00  4392
 7 2016-02-01 00:00:00  4314
 8 2016-09-01 00:00:00  4221
 9 2016-03-01 00:00:00  4194
10 2012-10-01 00:00:00  4173
# ℹ 46 more rows

So the most tickets in this dataset were issued in September of 2014. April of 2015 was second. Then two Septembers and an October.

Any guesses why those months?

I’ll give you a hint. It involves 90,000 people gathering in a big building on campus in the fall and one day in April or late March every spring.