I have made many failed attempts to scrape a page from a website that I have successfully scraped in other use cases. In this particular case I can't seem yield anything but the error:
"Error in UseMethod("html_table"): no applicable method for 'html_table' applied to an object of class 'xml_missing'."
In general with web scraping in R, I've been having difficulty finding the right css selectors (or sequencing) and tools like SelectorGadget have been of little help.
See below for several code chunks I've tried. Grateful for any proposed solutions. As a bonus, any best resources on R webscraping in general are appreciated.
library(tidyverse)
library(rvest)
library(xml2)
url <- 'https://baseballsavant.mlb.com/leaderboard/percentile-rankings?type=batter&team='
hitting <- url %>%
read_html() %>%
html_node('#prLeaderboard div.table-savant') %>%
html_table()
url <- 'https://baseballsavant.mlb.com/leaderboard/percentile-rankings?type=batter&team='
hitting <- url %>%
read_html() %>%
html_node('#prLeaderboard div.table-savant') %>%
html_table()
url <- 'https://baseballsavant.mlb.com/leaderboard/percentile-rankings?type=batter&team='
hitting <- url %>%
read_html() %>%
html_element(xpath = "//div[@id='statcastHitting']/div[@class='table-savant']") %>%
html_table()
url <- 'https://baseballsavant.mlb.com/leaderboard/percentile-rankings?type=batter&team='
hitting <- url %>%
read_html() %>%
html_node("table") %>%
html_table()
The problem is that this table does not get sent as a HTML table, but inside a <script>
tag. You can see this by inspecting either the output of read_html
or the page source itself:
In your browser this script will get executed & populate the table, but rvest
does no such thing. It is possible to evaluate the JavaScript and extract that variable though (see also here):
url <- 'https://baseballsavant.mlb.com/leaderboard/percentile-rankings?type=batter&team='
scripts <- rvest::read_html(url) |>
rvest::html_nodes(xpath=".//script")
## We're looking for the 'leaderboard_data' JavaScript var
varname <- "leaderboard_data"
## It's the 4th script on the page, not otherwise labelled
var <- rvest::html_text(scripts[[4]]) |>
stringi::stri_split_lines() |>
purrr::flatten_chr() |>
purrr::keep(stringi::stri_detect_regex, varname)
## Fire up the JavaScript parser
jsx <- V8::v8()
jsx$eval(var)
jsx$get(varname)
## player_name team_name team_id year player_type ...
## 1 Austin Hedges Rangers 140 2023 batter
## 2 Zach Remillard White Sox 145 2023 batter
## 3 Alec Burleson Cardinals 138 2023 batter
## 4 Connor Wong Red Sox 111 2023 batter
## ...
There is however a much easier way to get these data: the website provides a direct CSV download. Just add &csv=true
to the URL, no other tools needed:
read.csv(paste0(url, "&csv=true"))
## player_name player_id year xwoba xba xslg ...
## 1 Austin Hedges 595978 2023 NA NA NA
## 2 Zach Remillard 621545 2023 NA NA NA
## 3 Alec Burleson 676475 2023 65 85 68
## 4 Connor Wong 657136 2023 4 4 14
## ...