16  Stacked bar charts

One of the elements of data visualization excellence is inviting comparison. Often that comes in showing what proportion a thing is in relation to the whole thing. With bar charts, we’re showing magnitude of the whole thing. If we have information about the parts of the whole, we can stack them on top of each other to compare them, showing both the whole and the components. And it’s a simple change to what we’ve already done.

We’re going to use a dataset of college basketball games from this past season.

For this walkthrough:

Load the tidyverse.

library(tidyverse)

And the data.

games <- read_csv("data/logs22.csv")
Rows: 10775 Columns: 48
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr   (8): Season, TeamFull, Opponent, HomeAway, W_L, URL, Conference, Team
dbl  (39): Game, TeamScore, OpponentScore, TeamFG, TeamFGA, TeamFGPCT, Team3...
date  (1): Date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

What we have here is every game in college football this season. The question we want to answer is this: Who had the most prolific offenses in the Big Ten? And how did they get there?

So to make this chart, we have to just add one thing to a bar chart like we did in the previous chapter. However, it’s not that simple.

We have game data, and we need season data. To get that, we need to do some group by and sum work. And since we’re only interested in the Big Ten, we have some filtering to do too. For this, we’re going to measure offensive production by rushing yards and passing yards. So if we have all the games a team played, and the rushing and passing yards for each of those games, what we need to do to get the season totals is just add them up.

games |> 
  group_by(Conference, Team) |> 
  summarise(
    SeasonOffRebounds = sum(TeamOffRebounds),
    SeasonTotalRebounds = sum(TeamTotalRebounds)
  ) |>
  mutate(
    SeasonDefRebounds = SeasonTotalRebounds - SeasonOffRebounds
  ) |> 
  select(
    -SeasonTotalRebounds
  ) |> 
  filter(Conference == "Big Ten")
# A tibble: 14 × 4
# Groups:   Conference [1]
   Conference Team           SeasonOffRebounds SeasonDefRebounds
   <chr>      <chr>                      <dbl>             <dbl>
 1 Big Ten    Illinois                     300               764
 2 Big Ten    Indiana                      228               770
 3 Big Ten    Iowa                         333               742
 4 Big Ten    Maryland                     256               767
 5 Big Ten    Michigan                     265               721
 6 Big Ten    Michigan State               268               774
 7 Big Ten    Minnesota                    132               674
 8 Big Ten    Nebraska                     196               762
 9 Big Ten    Northwestern                 226               715
10 Big Ten    Ohio State                   225               706
11 Big Ten    Penn State                   224               707
12 Big Ten    Purdue                       295               794
13 Big Ten    Rutgers                      267               715
14 Big Ten    Wisconsin                    240               730

By looking at this, we can see we got what we needed. We have 14 teams and numbers that look like season totals for two types of rebounds. Save that to a new dataframe.

games |> 
  group_by(Conference, Team) |> 
  summarise(
    SeasonOffRebounds = sum(TeamOffRebounds),
    SeasonTotalRebounds = sum(TeamTotalRebounds)
  ) |>
  mutate(
    SeasonDefRebounds = SeasonTotalRebounds - SeasonOffRebounds
  ) |> 
  select(
    -SeasonTotalRebounds
  ) |> 
  filter(Conference == "Big Ten") -> rebounds

Now, the problem we have is that ggplot wants long data and this data is wide. So we need to use tidyr to make it long, just like we did in the transforming data chapter.

rebounds |> 
  pivot_longer(
    cols=starts_with("Season"), 
    names_to="Type", 
    values_to="Rebounds")
# A tibble: 28 × 4
# Groups:   Conference [1]
   Conference Team     Type              Rebounds
   <chr>      <chr>    <chr>                <dbl>
 1 Big Ten    Illinois SeasonOffRebounds      300
 2 Big Ten    Illinois SeasonDefRebounds      764
 3 Big Ten    Indiana  SeasonOffRebounds      228
 4 Big Ten    Indiana  SeasonDefRebounds      770
 5 Big Ten    Iowa     SeasonOffRebounds      333
 6 Big Ten    Iowa     SeasonDefRebounds      742
 7 Big Ten    Maryland SeasonOffRebounds      256
 8 Big Ten    Maryland SeasonDefRebounds      767
 9 Big Ten    Michigan SeasonOffRebounds      265
10 Big Ten    Michigan SeasonDefRebounds      721
# ℹ 18 more rows

What you can see now is that we have two rows for each team: One for rushing yards, one for passing yards. This is what ggplot needs. Save it to a new dataframe.

rebounds |> 
  pivot_longer(
    cols=starts_with("Season"), 
    names_to="Type", 
    values_to="Rebounds") -> reboundswide

Building on what we learned in the last chapter, we know we can turn this into a bar chart with an x value, a weight and a geom_bar. What we are going to add is a fill. The fill will stack bars on each other based on which element it is. In this case, we can fill the bar by Type, which means it will stack the number of rushing yards on top of passing yards and we can see how they compare.

ggplot() + 
  geom_bar(
    data=reboundswide, 
    aes(x=Team, weight=Rebounds, fill=Type)) + 
  coord_flip()

What’s the problem with this chart?

There’s a couple of things, one of which we’ll deal with now: The ordering is alphabetical (from the bottom up). So let’s reorder the teams by Rebounds.

ggplot() + 
  geom_bar(
    data=reboundswide, 
    aes(x=reorder(Team, Rebounds), 
        weight=Rebounds, 
        fill=Type)) + 
  coord_flip()

And just like that … Purdue comes out on top? Huh. And look who is not last.