Search code examples
rdplyrrvest

Rvest Pulls Empty Tables


The site I use to scrape data has changed and I'm having issues pulling the data into table format. I used two different types of codes below trying to get the tables, but it is returning blanks instead of tables.

I'm a novice in regards to scraping and would appreciate the expertise of the group. Should I look for other solutions in rvest, or try to learn a program like rSelenium?

https://www.pgatour.com/stats/detail/02675

Scrape for Multiple Links

library("dplyr")
library("purr")
library("rvest")

df23 <- expand.grid(
  stat_id = c("02568","02674", "02567", "02564", "101")  
) %>% 
  mutate(
    links = paste0(
      'https://www.pgatour.com/stats/detail/',
      stat_id
    )
  ) %>% 
  as_tibble()

#replaced tournament_id with stat_id
get_info <- function(link, stat_id){
  data <- link %>%
    read_html() %>%
    html_table() %>%
    .[[2]] 
}

test_main_stats <- df23 %>%
  mutate(tables = map2(links, stat_id, possibly(get_info, otherwise = tibble())))

test_main_stats <- test_main_stats %>% 
  unnest(everything())

Alternative Code

url <- read_html("https://www.pgatour.com/stats/detail/02568")
test1 <- url %>%
  html_nodes(".css-8atqhb") %>%
  html_table

Solution

  • This page uses javascript to create the table, so rvest will not directly work. But if one examines the page's source code, all of the data is stored in JSON format in a "<script>" node.

    This code finds that node and converts from JSON to a list. The variable is the main table but there is a wealth of other information contained in the JSON data struture.

    #read page
    library(rvest)
    page <- read_html("https://www.pgatour.com/stats/detail/02675")
    
    #find the script with the correct id tage, strip the html code
    datascript <- page %>% html_elements(xpath = ".//script[@id='__NEXT_DATA__']") %>% html_text()
    
    #convert from JSON 
    output <- jsonlite::fromJSON(datascript)
    #explore the output
    str(output)
    
    #get the main table 
    answer <-output$props$pageProps$statDetails$rows