library(rvest)
library(tidyverse)
35 Intro to rvest
All the way back in Chapter 2, we used Google Sheets and importHTML to get our own data out of a website. For me, that’s a lot of pointing and clicking and copying and pasting. R has a library that can automate the harvesting of data from HTML on the internet. It’s called rvest
.
Let’s grab a simple, basic HTML table from College Football Stats. There’s nothing particularly strange about this table – it’s simply formatted and easy to scrape.
First we’ll need some libraries. We’re going to use a library called rvest
, which you can get by running install.packages('rvest')
in the console.
The rvest package has functions that make fetching, reading and parsing HTML simple. The first thing we need to do is specify a url that we’re going to scrape.
<- "http://www.cfbstats.com/2021/leader/national/team/offense/split01/category09/sort01.html" scoringoffenseurl
Now, the most difficult part of scraping data from any website is knowing what exact HTML tag you need to grab. In this case, we want a <table>
tag that has all of our data table in it. But how do you tell R which one that is? Well, it’s easy, once you know what to do. But it’s not simple. So I’ve made a short video to show you how to find it.
When you have simple tables, the code is very simple. You create a variable to receive the data, then pass it the url, read the html that was fetched, find the node you need using your XPath value you just copied and you tell rvest that it’s a table.
<- scoringoffenseurl |>
scoringoffense read_html() |>
html_nodes(xpath = '//*[@id="content"]/div[2]/table') |>
html_table()
What we get from this is … not a dataframe. It’s a list with one element in it, which just so happens to be our dataframe. When you get this, the solution is simple: just overwrite the variable you created with the first list element.
<- scoringoffense[[1]] scoringoffense
And what do we have?
head(scoringoffense)
# A tibble: 6 × 10
`` Name G TD FG `1XP` `2XP` Safety Points `Points/G`
<int> <chr> <int> <int> <int> <int> <int> <int> <int> <dbl>
1 1 Ohio State 13 76 20 76 0 1 594 45.7
2 2 Western Kentucky 14 79 23 72 2 0 619 44.2
3 3 Pittsburgh 14 76 17 71 0 1 580 41.4
4 4 Wake Forest 14 72 23 65 4 0 574 41
5 5 Coastal Carolina 13 72 9 65 2 2 532 40.9
6 6 Alabama 15 75 23 71 4 0 598 39.9
We have data, ready for analysis.
35.1 A slightly more complicated example
What if we want more than one year in our dataframe?
This is a common problem. What if we want to look at every scoring offense going back several years? The website has them going back to 2009. How can we combine them?
First, we should note, that the data does not have anything in it to indicate what year it comes from. So we’re going to have to add that. And we’re going to have to figure out a way to stack two dataframes on top of each other.
So let’s grab 2020.
<- "http://www.cfbstats.com/2020/leader/national/team/offense/split01/category09/sort01.html"
scoringoffenseurl20
<- scoringoffenseurl20 |>
scoringoffense20 read_html() |>
html_nodes(xpath = '//*[@id="content"]/div[2]/table') |>
html_table()
<- scoringoffense20[[1]] scoringoffense20
First, how are we going to know, in the data, which year our data is from? We can use mutate.
<- scoringoffense |> mutate(YEAR = 2020) scoringoffense20
Error in `mutate()`:
! Can't transform a data frame with `NA` or `""` names.
Uh oh. Error. What does it say? It’s … not clear, but a hint is that our first column doesn’t have a name. Each column must be named. If you look at our data in the environment tab in the upper right corner, you’ll see that indeed, the first column has no name. It’s the FBS rank of each team. So we can fix that and mutate in the same step. We’ll do that using rename
and since the field doesn’t have a name to rename it, we’ll use a position argument. We’ll say rename column 1 as Rank.
<- scoringoffense |> rename(Rank = 1) |> mutate(YEAR = 2021)
scoringoffense21 <- scoringoffense20 |> rename(Rank = 1) |> mutate(YEAR = 2020) scoringoffense20
And now, to combine the two tables together length-wise – we need to make long data – we’ll use a dpylr function called bind_rows
. The good thing is bind_rows is simple.
<- bind_rows(scoringoffense21, scoringoffense20) combined
Note in the environment tab we now have a data frame called combined that has 257 observations – which just so happens to be what 127 from 2020 and 130 from 2021 add up to.
head(combined)
# A tibble: 6 × 11
Rank Name G TD FG `1XP` `2XP` Safety Points `Points/G` YEAR
<int> <chr> <int> <int> <int> <int> <int> <int> <int> <dbl> <dbl>
1 1 Ohio State 13 76 20 76 0 1 594 45.7 2021
2 2 Western Ke… 14 79 23 72 2 0 619 44.2 2021
3 3 Pittsburgh 14 76 17 71 0 1 580 41.4 2021
4 4 Wake Forest 14 72 23 65 4 0 574 41 2021
5 5 Coastal Ca… 13 72 9 65 2 2 532 40.9 2021
6 6 Alabama 15 75 23 71 4 0 598 39.9 2021
35.2 An even more complicated example
What do you do when the table has non-standard headers?
Unfortunately, non-standard means there’s no one way to do it – it’s going to depend on the table and the headers. But here’s one idea: Don’t try to make it work.
I’ll explain.
Let’s try to get season team stats from Sports Reference. If you look at that page, you’ll see the problem right away – the headers span two rows, and they repeat. That’s going to be all kinds of no good. You can’t import that. Dataframes must have names all in one row. If you have two-line headers, you have a problem you have to fix before you can do anything else with it.
First we’ll grab the page.
<- "https://www.sports-reference.com/cbb/seasons/2020-school-stats.html" url
Now, similar to our example above, we’ll read the html, use XPath to find the table, and then read that table with a directive passed to it setting the header to FALSE. That tells rvest that there isn’t a header row. Just import it as data.
<- url |>
stats read_html() |>
html_nodes(xpath = '//*[@id="basic_school_stats"]') |>
html_table(header=FALSE)
What we get back is a list of one element (similar to above). So let’s pop it out into a data frame.
<- stats[[1]] stats
And we’ll take a look at what we have.
head(stats)
# A tibble: 6 × 38
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <lgl> <chr> <chr> <lgl> <chr>
1 "" "" Over… Over… Over… Over… Over… Over… NA Conf. Conf. NA Home
2 "Rk" "Scho… G W L W-L% SRS SOS NA W L NA W
3 "1" "Abil… 31 20 11 .645 -2.87 -6.87 NA 15 5 NA 13
4 "2" "Air … 32 12 20 .375 -0.37 3.02 NA 5 13 NA 8
5 "3" "Akro… 31 24 7 .774 7.15 -0.40 NA 14 4 NA 15
6 "4" "Alab… 31 16 15 .516 11.12 8.12 NA 8 10 NA 10
# ℹ 25 more variables: X14 <chr>, X15 <lgl>, X16 <chr>, X17 <chr>, X18 <lgl>,
# X19 <chr>, X20 <chr>, X21 <lgl>, X22 <chr>, X23 <chr>, X24 <chr>,
# X25 <chr>, X26 <chr>, X27 <chr>, X28 <chr>, X29 <chr>, X30 <chr>,
# X31 <chr>, X32 <chr>, X33 <chr>, X34 <chr>, X35 <chr>, X36 <chr>,
# X37 <chr>, X38 <chr>
So, that’s not ideal. We have headers and data mixed together, and our columns are named X1 to X38. Also note: They’re all character fields. Because the headers are interspersed with data, it all gets called character data. So we’ve got to first rename each field.
<- stats |> rename(Rank=X1, School=X2, Games=X3, OverallWins=X4, OverallLosses=X5, WinPct=X6, OverallSRS=X7, OverallSOS=X8, Blank1=X9, ConferenceWins=X10, ConferenceLosses=X11, Blank2=X12, HomeWins=X13, HomeLosses=X14, Blank3=X15, AwayWins=X16, AwayLosses=X17, Blank4=X18, ForPoints=X19, OppPoints=X20, Blank5=X21, Minutes=X22, FieldGoalsMade=X23, FieldGoalsAttempted=X24, FieldGoalPCT=X25, ThreePointMade=X26, ThreePointAttempts=X27, ThreePointPct=X28, FreeThrowsMade=X29, FreeThrowsAttempted=X30, FreeThrowPCT=X31, OffensiveRebounds=X32, TotalRebounds=X33, Assists=X34, Steals=X35, Blocks=X36, Turnovers=X37, PersonalFouls=X38) stats
Now we have to get rid of those headers interspersed in the data. We can do that with filter that say keep all the stuff that isn’t this.
<- stats |> filter(Rank != "Rk" & Games != "Overall") stats
And finally, we need to change the file type of all the fields that need it. We’re going to use a clever little trick, which goes like this: We’re going to use mutate_at
, which means mutate these fields. The pattern for mutate_at
is mutate_at
these variables and do this thing to them. But instead of specifying which of 38 variables we’re going to mutate, we’re going to specify the one we don’t want to change, which is the name of the school. And we just want to convert them to numeric, which is simple. Here’s what it looks like:
<- stats |> mutate_at(vars(-School), as.numeric) stats
One last thing: Who needs columns called Blank1, Blank2, Blank3, etc?
<- stats |> select(-starts_with("Blank")) stats
And just like that, we have a method for getting up to the minute season stats for every team in Division I.
head(stats)
# A tibble: 6 × 33
Rank School Games OverallWins OverallLosses WinPct OverallSRS OverallSOS
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 Abilene Ch… 31 20 11 0.645 -2.87 -6.87
2 2 Air Force 32 12 20 0.375 -0.37 3.02
3 3 Akron 31 24 7 0.774 7.15 -0.4
4 4 Alabama 31 16 15 0.516 11.1 8.12
5 5 Alabama A&M 30 8 22 0.267 -18.9 -8.85
6 6 Alabama St… 32 8 24 0.25 -16.2 -6.53
# ℹ 25 more variables: ConferenceWins <dbl>, ConferenceLosses <dbl>,
# HomeWins <dbl>, HomeLosses <dbl>, AwayWins <dbl>, AwayLosses <dbl>,
# ForPoints <dbl>, OppPoints <dbl>, Minutes <dbl>, FieldGoalsMade <dbl>,
# FieldGoalsAttempted <dbl>, FieldGoalPCT <dbl>, ThreePointMade <dbl>,
# ThreePointAttempts <dbl>, ThreePointPct <dbl>, FreeThrowsMade <dbl>,
# FreeThrowsAttempted <dbl>, FreeThrowPCT <dbl>, OffensiveRebounds <dbl>,
# TotalRebounds <dbl>, Assists <dbl>, Steals <dbl>, Blocks <dbl>, …