Search code examples
rweb-scrapinghtml-tablervest

Why are my attempts at an rvest webscrape failing (Error in UseMethod)?


I have made many failed attempts to scrape a page from a website that I have successfully scraped in other use cases. In this particular case I can't seem yield anything but the error:

"Error in UseMethod("html_table"): no applicable method for 'html_table' applied to an object of class 'xml_missing'."

In general with web scraping in R, I've been having difficulty finding the right css selectors (or sequencing) and tools like SelectorGadget have been of little help.

See below for several code chunks I've tried. Grateful for any proposed solutions. As a bonus, any best resources on R webscraping in general are appreciated.

library(tidyverse)
library(rvest)
library(xml2)
  
url <- 'https://baseballsavant.mlb.com/leaderboard/percentile-rankings?type=batter&team='
  
    hitting <- url %>% 
    read_html() %>%
    html_node('#prLeaderboard div.table-savant') %>%
    html_table()
url <- 'https://baseballsavant.mlb.com/leaderboard/percentile-rankings?type=batter&team='
  
    hitting <- url %>% 
    read_html() %>%
    html_node('#prLeaderboard div.table-savant') %>%
    html_table()
url <- 'https://baseballsavant.mlb.com/leaderboard/percentile-rankings?type=batter&team='
  
    hitting <- url %>% 
    read_html() %>%
    html_element(xpath = "//div[@id='statcastHitting']/div[@class='table-savant']") %>%
    html_table()

url <- 'https://baseballsavant.mlb.com/leaderboard/percentile-rankings?type=batter&team='
  
    hitting <- url %>% 
    read_html() %>%
    html_node("table") %>%
    html_table()


Solution

  • The problem is that this table does not get sent as a HTML table, but inside a <script> tag. You can see this by inspecting either the output of read_html or the page source itself:

    raw html source

    In your browser this script will get executed & populate the table, but rvest does no such thing. It is possible to evaluate the JavaScript and extract that variable though (see also here):

    url <- 'https://baseballsavant.mlb.com/leaderboard/percentile-rankings?type=batter&team='
    
    scripts <- rvest::read_html(url) |>
      rvest::html_nodes(xpath=".//script")
    
    ## We're looking for the 'leaderboard_data' JavaScript var
    varname <- "leaderboard_data"
    
    ## It's the 4th script on the page, not otherwise labelled
    var <-  rvest::html_text(scripts[[4]]) |>
      stringi::stri_split_lines() |>
      purrr::flatten_chr() |>
      purrr::keep(stringi::stri_detect_regex, varname)
    
    ## Fire up the JavaScript parser
    jsx <- V8::v8()
    jsx$eval(var)
    jsx$get(varname)
    
    ##              player_name team_name team_id year player_type   ...
    ## 1          Austin Hedges   Rangers     140 2023      batter 
    ## 2         Zach Remillard White Sox     145 2023      batter 
    ## 3          Alec Burleson Cardinals     138 2023      batter 
    ## 4            Connor Wong   Red Sox     111 2023      batter 
    ## ...
    

    There is however a much easier way to get these data: the website provides a direct CSV download. Just add &csv=true to the URL, no other tools needed:

    read.csv(paste0(url, "&csv=true"))
    
    ##              player_name player_id year xwoba xba xslg    ...
    ## 1          Austin Hedges    595978 2023    NA  NA   NA
    ## 2         Zach Remillard    621545 2023    NA  NA   NA
    ## 3          Alec Burleson    676475 2023    65  85   68
    ## 4            Connor Wong    657136 2023     4   4   14
    ## ...