Search code examples
rweb-scrapingrvest

Scraping tables from sports reference with RVEST


I'm trying to scrape the various tables from this webpage: https://www.pro-football-reference.com/years/2020/

When inspecting the elements of the page, I found it easy to obtain the first two tables by using the following code:

### packages
library(tidyverse)
library(rvest)

### Scrape offense
url_off <- read_html("https://www.pro-football-reference.com/years/2020/")


## AFC Standings
url_off %>% 
  html_table(fill = TRUE) %>% 
  .[1] %>% 
  as.data.frame()

## NFC Standings
url_off %>% 
  html_table(fill = TRUE) %>% 
  .[2] %>% 
  as.data.frame()

Where I am stuck is every other table on that page.

For example, the offense table, I can see where it is on the page:

enter image description here

I've tried a few ways of extracting it without any luck. For example:

url_off %>%
  html_nodes(".table_outer_container") %>%
  html_nodes("#team_stats")

url_off %>%
  html_nodes(".table_wrapper") %>%
  html_nodes("#team_stats")

This seems to be an issue when I try and extract any of the other tables from that page. The only two tables I can get are the first two (above). I can't figure out where I am going wrong.


Solution

  • I've sorted it out. The data is all stored as a comment, which I think was my issue. Here is how I've extracted the tables, for anyone interested or having similar issues:

    url_off %>%
      html_nodes('#all_team_stats') %>%   
      html_nodes(xpath = 'comment()') %>%
      html_text() %>%
      read_html() %>%
      html_node('table') %>%
      html_table()
    
    
    url_off %>%
      html_nodes('#all_passing') %>%   
      html_nodes(xpath = 'comment()') %>%
      html_text() %>%
      read_html() %>%
      html_node('table') %>%
      html_table()