17  Visualizing your data for reporting

Visualizing data is becoming a much greater part of journalism. Large news organizations are creating graphics desks that create complex visuals with data to inform the public about important events.

To do it well is a course on it’s own. And not every story needs a feat of programming and art. Sometimes, you can help yourself and your story by just creating a quick chart.

Good news: one of the best libraries for visualizing data is in the tidyverse and it’s pretty simple to make simple charts quickly with just a little bit of code.

Let’s revisit some data we’ve used in the past and turn it into charts.

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

The dataset we’ll use is the mountainlion data we looked at in Chapter 6.

mountainlions <- read_csv("data/mountainlions.csv")
Rows: 393 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): Cofirm Type, COUNTY, Date
dbl (1): ID

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

17.1 Bar charts

The first kind of chart we’ll create is a simple bar chart. It’s a chart designed to show differences between things – the magnitude of one, compared to the next, and the next, and the next. So if we have thing, like a county, or a state, or a group name, and then a count of that group, we can make a bar chart.

So what does the chart of the top 10 counties with the most mountain sightings look like?

First, we’ll create a dataframe of those top 10, called topsightings.

mountainlions |>
  group_by(COUNTY) |>
  summarise(
    total = n()
  ) |> top_n(10, total) -> topsightings

Now ggplot. The first thing we do with ggplot is invoke it, which creates the canvas. In ggplot, we work with geometries – the shape that the data will take – and aesthetics – the data that will take shape. In a bar chart, we first pass in the data to the geometry, then set the aesthetic. We tell ggplot what the x value is – in a bar chart, that’s almost always your grouping variable. Then we tell it the weight of the bar – the number that will set the height.

ggplot() + geom_bar(data=topsightings, aes(x=COUNTY, weight=total))

The bars look good, but he order makes no sense. In ggplot, we use reorder, and we reorder the x value based on the weight, like this:

ggplot() + geom_bar(data=topsightings, aes(x=reorder(COUNTY, total), weight=total))

Better, but it looks … not great on the bottom. We can fix that by flipping the coordinates.

ggplot() + geom_bar(data=topsightings, aes(x=reorder(COUNTY, total), weight=total)) + coord_flip()

Art? No. Tells you the story? Yep. And for reporting purposes, that’s enough.

17.2 Line charts

Line charts show change over time. It works the much the same as a bar chart, code wise, but instead of a weight, it uses a y. And if you have more than one group in your data, it takes a group element.

The secret to knowing if you have a line chart is if you have a date. The secret to making a line chart is your x value is almost always a date.

To make this easier, I’ve created a new version of the county population estimates data that it formatted for charting. Let’s import it:

populationestimates <- read_csv("data/countypopulationslong.csv")
Rows: 28278 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (4): STNAME, CTYNAME, Year, Name
dbl  (1): Population
date (1): Date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

As you can see, I’ve switched the data from the years going wide to the right to each line being one county, one year. And, I’ve added a date column, which is the estimates date.

Now, if we tried to make a line chart of all 3,142 counties, we’d get a mess. But, it’s the first mistake people make in creating a line chart, so let’s do that.

ggplot() + geom_line(data=populationestimates, aes(x=Date, y=Population, group=Name))

And what do we learn from this? There’s one very, very big county, some less big counties, and a ton of smaller counties. So many we can’t see Them, and because the numbers are so big, any changes are dwarfed.

So let’s thin the herd here. How about we just look at Lancaster County, Nebraska.

lancaster <- populationestimates |> filter(Name == "Lancaster County, Nebraska")

Now let’s chart it.

ggplot() + geom_line(data=lancaster, aes(x=Date, y=Population, group=Name))

Growing like gangbusters, right? Well, not exactly. Note the y axis doesn’t start at 0.

ggplot() + geom_line(data=lancaster, aes(x=Date, y=Population, group=Name)) + scale_y_continuous(limits = c(0, 320000))

More accurate.

But how does that compare to Omaha? Geoms work in layers. We can just add Omaha. First, we need a dataframe with Douglas County in it.

douglas <- populationestimates |> filter(Name == "Douglas County, Nebraska")

With that, we just add another geom_line. We will also need to adjust our y axis limits to expand to fit Omaha.

ggplot() + geom_line(data=lancaster, aes(x=Date, y=Population, group=Name)) + geom_line(data=douglas, aes(x=Date, y=Population, group=Name)) + scale_y_continuous(limits = c(0, 650000))

So both counties are growing, but from the line we can see that Omaha is growing just ever so slightly faster. No accounting for taste.