35 Intro to rvest

All the way back in Chapter 2, we used Google Sheets and importHTML to get our own data out of a website. For me, that’s a lot of pointing and clicking and copying and pasting. R has a library that can automate the harvesting of data from HTML on the internet. It’s called rvest.

Let’s grab a simple, basic HTML table from College Football Stats. There’s nothing particularly strange about this table – it’s simply formatted and easy to scrape.

First we’ll need some libraries. We’re going to use a library called rvest, which you can get by running install.packages('rvest') in the console.

library(rvest)
library(tidyverse)

The rvest package has functions that make fetching, reading and parsing HTML simple. The first thing we need to do is specify a url that we’re going to scrape.

scoringoffenseurl <- "http://www.cfbstats.com/2021/leader/national/team/offense/split01/category09/sort01.html"

Now, the most difficult part of scraping data from any website is knowing what exact HTML tag you need to grab. In this case, we want a <table> tag that has all of our data table in it. But how do you tell R which one that is? Well, it’s easy, once you know what to do. But it’s not simple. So I’ve made a short video to show you how to find it.

When you have simple tables, the code is very simple. You create a variable to receive the data, then pass it the url, read the html that was fetched, find the node you need using your XPath value you just copied and you tell rvest that it’s a table.

scoringoffense <- scoringoffenseurl |>
  read_html() |>
  html_nodes(xpath = '//*[@id="content"]/div[2]/table') |>
  html_table()

What we get from this is … not a dataframe. It’s a list with one element in it, which just so happens to be our dataframe. When you get this, the solution is simple: just overwrite the variable you created with the first list element.

scoringoffense <- scoringoffense[[1]]

And what do we have?

head(scoringoffense)

# A tibble: 6 × 10
     `` Name                 G    TD    FG `1XP` `2XP` Safety Points `Points/G`
  <int> <chr>            <int> <int> <int> <int> <int>  <int>  <int>      <dbl>
1     1 Ohio State          13    76    20    76     0      1    594       45.7
2     2 Western Kentucky    14    79    23    72     2      0    619       44.2
3     3 Pittsburgh          14    76    17    71     0      1    580       41.4
4     4 Wake Forest         14    72    23    65     4      0    574       41  
5     5 Coastal Carolina    13    72     9    65     2      2    532       40.9
6     6 Alabama             15    75    23    71     4      0    598       39.9

We have data, ready for analysis.

35.1 A slightly more complicated example

What if we want more than one year in our dataframe?

This is a common problem. What if we want to look at every scoring offense going back several years? The website has them going back to 2009. How can we combine them?

First, we should note, that the data does not have anything in it to indicate what year it comes from. So we’re going to have to add that. And we’re going to have to figure out a way to stack two dataframes on top of each other.

So let’s grab 2020.

scoringoffenseurl20 <- "http://www.cfbstats.com/2020/leader/national/team/offense/split01/category09/sort01.html"

scoringoffense20 <- scoringoffenseurl20 |>
  read_html() |>
  html_nodes(xpath = '//*[@id="content"]/div[2]/table') |>
  html_table()

scoringoffense20 <- scoringoffense20[[1]]

First, how are we going to know, in the data, which year our data is from? We can use mutate.

scoringoffense20 <- scoringoffense |> mutate(YEAR = 2020)

Error in `mutate()`:
! Can't transform a data frame with `NA` or `""` names.

Uh oh. Error. What does it say? It’s … not clear, but a hint is that our first column doesn’t have a name. Each column must be named. If you look at our data in the environment tab in the upper right corner, you’ll see that indeed, the first column has no name. It’s the FBS rank of each team. So we can fix that and mutate in the same step. We’ll do that using rename and since the field doesn’t have a name to rename it, we’ll use a position argument. We’ll say rename column 1 as Rank.

scoringoffense21 <- scoringoffense |> rename(Rank = 1) |> mutate(YEAR = 2021)
scoringoffense20 <- scoringoffense20 |> rename(Rank = 1) |> mutate(YEAR = 2020)

And now, to combine the two tables together length-wise – we need to make long data – we’ll use a dpylr function called bind_rows. The good thing is bind_rows is simple.

combined <- bind_rows(scoringoffense21, scoringoffense20)

Note in the environment tab we now have a data frame called combined that has 257 observations – which just so happens to be what 127 from 2020 and 130 from 2021 add up to.

head(combined)

# A tibble: 6 × 11
   Rank Name            G    TD    FG `1XP` `2XP` Safety Points `Points/G`  YEAR
  <int> <chr>       <int> <int> <int> <int> <int>  <int>  <int>      <dbl> <dbl>
1     1 Ohio State     13    76    20    76     0      1    594       45.7  2021
2     2 Western Ke…    14    79    23    72     2      0    619       44.2  2021
3     3 Pittsburgh     14    76    17    71     0      1    580       41.4  2021
4     4 Wake Forest    14    72    23    65     4      0    574       41    2021
5     5 Coastal Ca…    13    72     9    65     2      2    532       40.9  2021
6     6 Alabama        15    75    23    71     4      0    598       39.9  2021

35.2 An even more complicated example

What do you do when the table has non-standard headers?

Unfortunately, non-standard means there’s no one way to do it – it’s going to depend on the table and the headers. But here’s one idea: Don’t try to make it work.

I’ll explain.

Let’s try to get season team stats from Sports Reference. If you look at that page, you’ll see the problem right away – the headers span two rows, and they repeat. That’s going to be all kinds of no good. You can’t import that. Dataframes must have names all in one row. If you have two-line headers, you have a problem you have to fix before you can do anything else with it.

First we’ll grab the page.

url <- "https://www.sports-reference.com/cbb/seasons/2020-school-stats.html"

Now, similar to our example above, we’ll read the html, use XPath to find the table, and then read that table with a directive passed to it setting the header to FALSE. That tells rvest that there isn’t a header row. Just import it as data.

stats <- url |>
  read_html() |>
  html_nodes(xpath = '//*[@id="basic_school_stats"]') |>
  html_table(header=FALSE)

What we get back is a list of one element (similar to above). So let’s pop it out into a data frame.

stats <- stats[[1]]

And we’ll take a look at what we have.

head(stats)

# A tibble: 6 × 38
  X1    X2     X3    X4    X5    X6    X7    X8    X9    X10   X11   X12   X13  
  <chr> <chr>  <chr> <chr> <chr> <chr> <chr> <chr> <lgl> <chr> <chr> <lgl> <chr>
1 ""    ""     Over… Over… Over… Over… Over… Over… NA    Conf. Conf. NA    Home 
2 "Rk"  "Scho… G     W     L     W-L%  SRS   SOS   NA    W     L     NA    W    
3 "1"   "Abil… 31    20    11    .645  -2.87 -6.87 NA    15    5     NA    13   
4 "2"   "Air … 32    12    20    .375  -0.37 3.02  NA    5     13    NA    8    
5 "3"   "Akro… 31    24    7     .774  7.15  -0.40 NA    14    4     NA    15   
6 "4"   "Alab… 31    16    15    .516  11.12 8.12  NA    8     10    NA    10   
# ℹ 25 more variables: X14 <chr>, X15 <lgl>, X16 <chr>, X17 <chr>, X18 <lgl>,
#   X19 <chr>, X20 <chr>, X21 <lgl>, X22 <chr>, X23 <chr>, X24 <chr>,
#   X25 <chr>, X26 <chr>, X27 <chr>, X28 <chr>, X29 <chr>, X30 <chr>,
#   X31 <chr>, X32 <chr>, X33 <chr>, X34 <chr>, X35 <chr>, X36 <chr>,
#   X37 <chr>, X38 <chr>

So, that’s not ideal. We have headers and data mixed together, and our columns are named X1 to X38. Also note: They’re all character fields. Because the headers are interspersed with data, it all gets called character data. So we’ve got to first rename each field.

stats <- stats |> rename(Rank=X1, School=X2, Games=X3, OverallWins=X4, OverallLosses=X5, WinPct=X6, OverallSRS=X7, OverallSOS=X8, Blank1=X9, ConferenceWins=X10, ConferenceLosses=X11, Blank2=X12, HomeWins=X13, HomeLosses=X14, Blank3=X15, AwayWins=X16, AwayLosses=X17, Blank4=X18, ForPoints=X19, OppPoints=X20, Blank5=X21, Minutes=X22, FieldGoalsMade=X23, FieldGoalsAttempted=X24, FieldGoalPCT=X25, ThreePointMade=X26, ThreePointAttempts=X27, ThreePointPct=X28, FreeThrowsMade=X29, FreeThrowsAttempted=X30, FreeThrowPCT=X31, OffensiveRebounds=X32, TotalRebounds=X33, Assists=X34, Steals=X35, Blocks=X36, Turnovers=X37, PersonalFouls=X38)

Now we have to get rid of those headers interspersed in the data. We can do that with filter that say keep all the stuff that isn’t this.

stats <- stats |> filter(Rank != "Rk" & Games != "Overall")

And finally, we need to change the file type of all the fields that need it. We’re going to use a clever little trick, which goes like this: We’re going to use mutate_at, which means mutate these fields. The pattern for mutate_at is mutate_at these variables and do this thing to them. But instead of specifying which of 38 variables we’re going to mutate, we’re going to specify the one we don’t want to change, which is the name of the school. And we just want to convert them to numeric, which is simple. Here’s what it looks like:

stats <- stats |> mutate_at(vars(-School), as.numeric)

One last thing: Who needs columns called Blank1, Blank2, Blank3, etc?

stats <- stats |> select(-starts_with("Blank"))

And just like that, we have a method for getting up to the minute season stats for every team in Division I.

head(stats)

# A tibble: 6 × 33
   Rank School      Games OverallWins OverallLosses WinPct OverallSRS OverallSOS
  <dbl> <chr>       <dbl>       <dbl>         <dbl>  <dbl>      <dbl>      <dbl>
1     1 Abilene Ch…    31          20            11  0.645      -2.87      -6.87
2     2 Air Force      32          12            20  0.375      -0.37       3.02
3     3 Akron          31          24             7  0.774       7.15      -0.4 
4     4 Alabama        31          16            15  0.516      11.1        8.12
5     5 Alabama A&M    30           8            22  0.267     -18.9       -8.85
6     6 Alabama St…    32           8            24  0.25      -16.2       -6.53
# ℹ 25 more variables: ConferenceWins <dbl>, ConferenceLosses <dbl>,
#   HomeWins <dbl>, HomeLosses <dbl>, AwayWins <dbl>, AwayLosses <dbl>,
#   ForPoints <dbl>, OppPoints <dbl>, Minutes <dbl>, FieldGoalsMade <dbl>,
#   FieldGoalsAttempted <dbl>, FieldGoalPCT <dbl>, ThreePointMade <dbl>,
#   ThreePointAttempts <dbl>, ThreePointPct <dbl>, FreeThrowsMade <dbl>,
#   FreeThrowsAttempted <dbl>, FreeThrowPCT <dbl>, OffensiveRebounds <dbl>,
#   TotalRebounds <dbl>, Assists <dbl>, Steals <dbl>, Blocks <dbl>, …