15 Scraping data with Rvest

Sometimes, governments put data online on a page or in a searchable database. And when you ask them for a copy of the data underneath the website, they say no. Why? Because they have a website. That’s it. That’s their reason. We don’t have to give you the data because we’ve already given you the data, never mind that they haven’t. One of the most powerful tools you can learn as a data journalist is how to scrape data. Scraping is the process of programming a computer to act like a browser, go to a website, injest the HTML from that website and turn it into data.

The degree of difficulty here goes from Easy to So Hard You Want To Throw Your Laptop Out A Window. And the curve between the two steep. You can learn how to scrape Easy in a day. The hard ones take months to years of programming experience.

So.

Let’s try an easy one.

We’re going to use a library called rvest, which you can install it the same way we’ve done all installs: go to the console and install.packages("rvest").

We’ll load them first:

library(rvest)
library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter()         masks stats::filter()
✖ readr::guess_encoding() masks rvest::guess_encoding()
✖ dplyr::lag()            masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

The first thing we need to do is define a URL. What URL are we going to scrape? This is where paying attention to URLs pays off. Some search urls are addressable – meaning you can copy the url of your search and go to it again and again. Or is the search term invisible?

Let’s take an example from the Nebraska Legislature. The Legislature publishes a daily agenda that tells you what bills will be debated on the floor that day. Here’s an example from Feb. 3. Go that page. You’ll see multiple sections – a reports section, and the General File section. General File is the first stage of floor debate. Those are the bills to be debated.

Let’s grab them with rvest.

First, we create a url variable and set it equal to that url.

url <- "https://nebraskalegislature.gov/calendar/agenda.php?day=2020-02-03"

Now we’re going to do a handful of things at once. We’re going to take that url, pass it to a read_html command, which does what you think it does. We’re then going to search that HTML for a specific node, the node that contains our data.

The most difficult part of scraping data from any website is knowing what exact HTML tag you need to grab. In this case, we want a <table> tag that has all of our data table in it. But how do you tell R which one that is? Well, it’s easy, once you know what to do. But it’s not simple. So I’ve made a short video to show you how to find it.

Using that same trick on the Legislature page, we find the table with those general file bills and it’s in something called agenda_table_4464. With that, we can turn it into a table.

agenda <- url |>
  read_html() |>
  html_nodes(xpath = '//*[@id="agenda_table_4464"]') |>
  html_table()

After doing that, looking at the environment for your agenda. And you’ll see … not a dataframe. You’ll see a list. With one thing in it. That one thing? A dataframe. So we need to grab that first element. We get it by doing this:

agenda <- agenda[[1]]

And now we can take a look:

agenda

# A tibble: 10 × 3
   Document                                               Introducer Description
   <chr>                                                  <chr>      <chr>      
 1 "Currently or Pendingon Floor\n                      … Bolz       Provide a …
 2 "Currently or Pendingon Floor\n                      … Kolterman  Adopt the …
 3 "Currently or Pendingon Floor\n                      … Kolterman  Appropriat…
 4 "Currently or Pendingon Floor\n                      … Bolz       Change eli…
 5 "Currently or Pendingon Floor\n                      … Kolterman  Change pro…
 6 "Currently or Pendingon Floor\n                      … Kolterman  Appropriat…
 7 "Currently or Pendingon Floor\n                      … Dorn       Change pro…
 8 "Currently or Pendingon Floor\n                      … Wishart    Change pro…
 9 "Currently or Pendingon Floor\n                      … McDonnell  Change pro…
10 "Currently or Pendingon Floor\n                      … Vargas     Change pro…

And as you can see, it’s … not perfect. But we can fix that with a little gsub magic.

agenda |> mutate(Document = gsub("Currently or Pendingon Floor\n ","",Document))

# A tibble: 10 × 3
   Document                                               Introducer Description
   <chr>                                                  <chr>      <chr>      
 1 "                                                    … Bolz       Provide a …
 2 "                                                    … Kolterman  Adopt the …
 3 "                                                    … Kolterman  Appropriat…
 4 "                                                    … Bolz       Change eli…
 5 "                                                    … Kolterman  Change pro…
 6 "                                                    … Kolterman  Appropriat…
 7 "                                                    … Dorn       Change pro…
 8 "                                                    … Wishart    Change pro…
 9 "                                                    … McDonnell  Change pro…
10 "                                                    … Vargas     Change pro…

15.1 A more difficult example

The Nebraska Legislature, with it’s unique unicameral or one house structure, does some things a little differently than other legislatures. Example: Bills have to go through three rounds of debate before getting passed. There’s General File (round 1), Select File (round 2), and Final Reading (round 3).

So what does a day where they do more than general file look like? Like this.

How do we scrape that?

harderurl <- "https://nebraskalegislature.gov/calendar/agenda.php?day=2020-02-21"

harderagenda <- harderurl |>
  read_html() |>
  html_nodes("table") |>
  html_table()

You can see that harderagenda has a list of four instead of a list of one. We can see each item in the list has a dataframe. We can see them individually. Here’s the first:

harderagenda[[1]]

# A tibble: 4 × 3
  Document                                                Introducer Description
  <chr>                                                   <chr>      <chr>      
1 "Currently or Pendingon Floor\n                       … Kolterman  Define the…
2 "Currently or Pendingon Floor\n                       … Geist      Change and…
3 "Currently or Pendingon Floor\n                       … Chambers   Change pro…
4 "Currently or Pendingon Floor\n                       … Gragert    Provide fo…

Here’s the second:

harderagenda[[2]]

# A tibble: 4 × 3
  Document                                                Introducer Description
  <chr>                                                   <chr>      <chr>      
1 "Currently or Pendingon Floor\n                       … Crawford   Change the…
2 "Currently or Pendingon Floor\n                       … Lindstrom  Change pro…
3 "Currently or Pendingon Floor\n                       … Quick      Change the…
4 "Currently or Pendingon Floor\n                       … Hunt       Adopt the …

So we can merge those together no problem, but how would we know what stage each bill is at?

Look at the page – we can see that the bills are separated by a big headline like “SELECT FILE: 2020 PRIORITY BILLS”. To separate these, we need to grab those and then add them to each bill using mutate.

Here’s how we grab them:

labels <- harderurl |>
  read_html() |>
  html_nodes(".card-header") |>
  html_text()

Another list. If you look at the first, it’s at the top of the page with no bills. Here’s the second:

labels[[2]]

[1] "\n                    SELECT FILE:  2020 PRIORITY BILLS\n                                    "

So we know can see there’s some garbage in there we want to clean out. We can use a new library called stringr to trim the excess spaces and gsub to strip the newline character: \n.

harderagenda[[1]] |> 
  mutate(Document = gsub("Currently or Pendingon Floor\n ","",Document)) |>
  mutate(Stage = labels[[2]]) |>
  mutate(Stage = gsub("\n","",Stage)) |>
  mutate(Stage = str_trim(Stage, side = "both"))

# A tibble: 4 × 4
  Document                                          Introducer Description Stage
  <chr>                                             <chr>      <chr>       <chr>
1 "                                               … Kolterman  Define the… SELE…
2 "                                               … Geist      Change and… SELE…
3 "                                               … Chambers   Change pro… SELE…
4 "                                               … Gragert    Provide fo… SELE…

Now it’s just a matter grinding through the items in the list.

NOTE: This is grossly inefficient and very manual. And, we’d have to change this for every day we want to scrape. As such, this is not the “right” way to do this. We’ll cover that in the next chapter.

harderagenda1 <- harderagenda[[1]] |> 
  mutate(Document = gsub("Currently or Pendingon Floor\n ","",Document)) |>
  mutate(Stage = labels[[2]]) |> mutate(Stage = gsub("\n","",Stage)) |>
  mutate(Stage = str_trim(Stage, side = "both"))

harderagenda2 <- harderagenda[[2]] |> 
  mutate(Document = gsub("Currently or Pendingon Floor\n ","",Document)) |>
  mutate(Stage = labels[[3]]) |> mutate(Stage = gsub("\n","",Stage)) |>
  mutate(Stage = str_trim(Stage, side = "both"))

harderagenda3 <- harderagenda[[3]] |> 
  mutate(Document = gsub("Currently or Pendingon Floor\n ","",Document)) |>
  mutate(Stage = labels[[4]]) |> mutate(Stage = gsub("\n","",Stage)) |>
  mutate(Stage = str_trim(Stage, side = "both"))

harderagenda4 <- harderagenda[[4]] |> 
  mutate(Document = gsub("Currently or Pendingon Floor\n ","",Document)) |>
  mutate(Stage = labels[[5]]) |> mutate(Stage = gsub("\n","",Stage)) |>
  mutate(Stage = str_trim(Stage, side = "both"))

Now we merge:

largeragenda <- rbind(harderagenda1, harderagenda2)
largeragenda <- rbind(largeragenda, harderagenda3)
largeragenda <- rbind(largeragenda, harderagenda4)

And now we have a dataset of all bills and what stage they’re at for that day.

largeragenda

# A tibble: 10 × 4
   Document                                         Introducer Description Stage
   <chr>                                            <chr>      <chr>       <chr>
 1 "                                              … Kolterman  Define the… "SEL…
 2 "                                              … Geist      Change and… "SEL…
 3 "                                              … Chambers   Change pro… "SEL…
 4 "                                              … Gragert    Provide fo… "SEL…
 5 "                                              … Crawford   Change the… "GEN…
 6 "                                              … Lindstrom  Change pro… "GEN…
 7 "                                              … Quick      Change the… "GEN…
 8 "                                              … Hunt       Adopt the … "GEN…
 9 "                                              … Agricultu… Adopt the … "GEN…
10 "                                              … Howard     Congratula… "LEG…