I am trying to scrape the table from the following webpage using R:
When inspecting the webpage, this is what I see in the HTML code:
The SelectorGadget just calls it "table". However, I am having trouble extracting the table into R studio. This is what I am trying now:
install.packages("rvest")
install.packages("dplyr")
library(rvest)
library(dplyr)
url <- "https://baseballsavant.mlb.com/leaderboard/home-runs?player_type=Batter&team=&min=0&cat=adj_xhr&year=2020"
# Read the webpage
webpage <- read_html(url)
# Extract the table
table <- webpage %>%
html_node(xpath = '//*[@id="homeruns"]/table') %>%
html_table()
# View the table
print(table)
Any help would be greatly appreciated!
The webpage uses javascript to display the page so the data is not stored in the table element. Luckily the data is instead it is stored in a "script" element as JSON.
I looked at the vector of returned scripts and visual identified the fourth script as the one containing the data of interest. So this may break at a later time or on a different page.
library(rvest)
library(dplyr)
url <- "https://baseballsavant.mlb.com/leaderboard/home-runs?player_type=Batter&team=&min=0&cat=adj_xhr&year=2020"
# Read the webpage
webpage <- read_html(url)
# Extract the table
scripts <- webpage %>%
html_elements('script')
#data is stored in the fourth script - by eye
myscript<- scripts[4] %>% html_text() %>% trimws()
#Mutliple data strutures are stored, spliting at the ";"
datafields <- strsplit(myscript, ";")
#convert the list to vector
data <- datafields[[1]]
#peek at the data
substr(data, 1, 30)
#remove everything before the first [
data <- sub(".+= (.*)", "\\1", data)
#convert from JSON
answer <- jsonlite::fromJSON(data[1])
dim(answer)
#[1] 476 51
The final dataframe has 51 columns and 476 rows of infornmation.