Sometimes, governments put data online on a page or in a searchable database. And when you ask them for a copy of the data underneath the website, they say no. Why? Because they have a website. That’s it. That’s their reason. We don’t have to give you the data because we’ve already given you the data, never mind that they haven’t. One of the most powerful tools you can learn as a data journalist is how to scrape data. Scraping is the process of programming a computer to act like a browser, go to a website, injest the HTML from that website and turn it into data.
The degree of difficulty here goes from Easy to So Hard You Want To Throw Your Laptop Out A Window. And the curve between the two steep. You can learn how to scrape Easy in a day. The hard ones take months to years of programming experience.
So.
Let’s try an easy one.
We’re going to use a library called rvest, which you can install it the same way we’ve done all installs: go to the console and install.packages("rvest").
We’ll load them first:
library(rvest)library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.2 ✔ readr 2.1.4
✔ forcats 1.0.0 ✔ stringr 1.5.0
✔ ggplot2 3.4.4 ✔ tibble 3.2.1
✔ lubridate 1.9.2 ✔ tidyr 1.3.0
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ readr::guess_encoding() masks rvest::guess_encoding()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
The first thing we need to do is define a URL. What URL are we going to scrape? This is where paying attention to URLs pays off. Some search urls are addressable – meaning you can copy the url of your search and go to it again and again. Or is the search term invisible?
Let’s take an example from the Nebraska Legislature. The Legislature publishes a daily agenda that tells you what bills will be debated on the floor that day. Here’s an example from Feb. 3. Go that page. You’ll see multiple sections – a reports section, and the General File section. General File is the first stage of floor debate. Those are the bills to be debated.
Let’s grab them with rvest.
First, we create a url variable and set it equal to that url.
Now we’re going to do a handful of things at once. We’re going to take that url, pass it to a read_html command, which does what you think it does. We’re then going to search that HTML for a specific node, the node that contains our data.
The most difficult part of scraping data from any website is knowing what exact HTML tag you need to grab. In this case, we want a <table> tag that has all of our data table in it. But how do you tell R which one that is? Well, it’s easy, once you know what to do. But it’s not simple. So I’ve made a short video to show you how to find it.
Using that same trick on the Legislature page, we find the table with those general file bills and it’s in something called agenda_table_4464. With that, we can turn it into a table.
After doing that, looking at the environment for your agenda. And you’ll see … not a dataframe. You’ll see a list. With one thing in it. That one thing? A dataframe. So we need to grab that first element. We get it by doing this:
agenda <- agenda[[1]]
And now we can take a look:
agenda
# A tibble: 10 × 3
Document Introducer Description
<chr> <chr> <chr>
1 "Currently or Pendingon Floor\n … Bolz Provide a …
2 "Currently or Pendingon Floor\n … Kolterman Adopt the …
3 "Currently or Pendingon Floor\n … Kolterman Appropriat…
4 "Currently or Pendingon Floor\n … Bolz Change eli…
5 "Currently or Pendingon Floor\n … Kolterman Change pro…
6 "Currently or Pendingon Floor\n … Kolterman Appropriat…
7 "Currently or Pendingon Floor\n … Dorn Change pro…
8 "Currently or Pendingon Floor\n … Wishart Change pro…
9 "Currently or Pendingon Floor\n … McDonnell Change pro…
10 "Currently or Pendingon Floor\n … Vargas Change pro…
And as you can see, it’s … not perfect. But we can fix that with a little gsub magic.
agenda |>mutate(Document =gsub("Currently or Pendingon Floor\n ","",Document))
The Nebraska Legislature, with it’s unique unicameral or one house structure, does some things a little differently than other legislatures. Example: Bills have to go through three rounds of debate before getting passed. There’s General File (round 1), Select File (round 2), and Final Reading (round 3).
So what does a day where they do more than general file look like? Like this.
You can see that harderagenda has a list of four instead of a list of one. We can see each item in the list has a dataframe. We can see them individually. Here’s the first:
harderagenda[[1]]
# A tibble: 4 × 3
Document Introducer Description
<chr> <chr> <chr>
1 "Currently or Pendingon Floor\n … Kolterman Define the…
2 "Currently or Pendingon Floor\n … Geist Change and…
3 "Currently or Pendingon Floor\n … Chambers Change pro…
4 "Currently or Pendingon Floor\n … Gragert Provide fo…
Here’s the second:
harderagenda[[2]]
# A tibble: 4 × 3
Document Introducer Description
<chr> <chr> <chr>
1 "Currently or Pendingon Floor\n … Crawford Change the…
2 "Currently or Pendingon Floor\n … Lindstrom Change pro…
3 "Currently or Pendingon Floor\n … Quick Change the…
4 "Currently or Pendingon Floor\n … Hunt Adopt the …
So we can merge those together no problem, but how would we know what stage each bill is at?
Look at the page – we can see that the bills are separated by a big headline like “SELECT FILE: 2020 PRIORITY BILLS”. To separate these, we need to grab those and then add them to each bill using mutate.
Another list. If you look at the first, it’s at the top of the page with no bills. Here’s the second:
labels[[2]]
[1] "\n SELECT FILE: 2020 PRIORITY BILLS\n "
So we know can see there’s some garbage in there we want to clean out. We can use a new library called stringr to trim the excess spaces and gsub to strip the newline character: \n.
harderagenda[[1]] |>mutate(Document =gsub("Currently or Pendingon Floor\n ","",Document)) |>mutate(Stage = labels[[2]]) |>mutate(Stage =gsub("\n","",Stage)) |>mutate(Stage =str_trim(Stage, side ="both"))
Now it’s just a matter grinding through the items in the list.
NOTE: This is grossly inefficient and very manual. And, we’d have to change this for every day we want to scrape. As such, this is not the “right” way to do this. We’ll cover that in the next chapter.
harderagenda1 <- harderagenda[[1]] |>mutate(Document =gsub("Currently or Pendingon Floor\n ","",Document)) |>mutate(Stage = labels[[2]]) |>mutate(Stage =gsub("\n","",Stage)) |>mutate(Stage =str_trim(Stage, side ="both"))
harderagenda2 <- harderagenda[[2]] |>mutate(Document =gsub("Currently or Pendingon Floor\n ","",Document)) |>mutate(Stage = labels[[3]]) |>mutate(Stage =gsub("\n","",Stage)) |>mutate(Stage =str_trim(Stage, side ="both"))
harderagenda3 <- harderagenda[[3]] |>mutate(Document =gsub("Currently or Pendingon Floor\n ","",Document)) |>mutate(Stage = labels[[4]]) |>mutate(Stage =gsub("\n","",Stage)) |>mutate(Stage =str_trim(Stage, side ="both"))
harderagenda4 <- harderagenda[[4]] |>mutate(Document =gsub("Currently or Pendingon Floor\n ","",Document)) |>mutate(Stage = labels[[5]]) |>mutate(Stage =gsub("\n","",Stage)) |>mutate(Stage =str_trim(Stage, side ="both"))