Search code examples
rweb-scraping

Scrape Home Run Leaderboard from Baseball Savant using R


I am trying to scrape the table from the following webpage using R:

https://baseballsavant.mlb.com/leaderboard/home-runs?player_type=Batter&team=&min=0&cat=adj_xhr&year=2020

When inspecting the webpage, this is what I see in the HTML code:

enter image description here

The SelectorGadget just calls it "table". However, I am having trouble extracting the table into R studio. This is what I am trying now:

install.packages("rvest")
install.packages("dplyr")

library(rvest)
library(dplyr)
url <- "https://baseballsavant.mlb.com/leaderboard/home-runs?player_type=Batter&team=&min=0&cat=adj_xhr&year=2020"

# Read the webpage
webpage <- read_html(url)

# Extract the table
table <- webpage %>%
  html_node(xpath = '//*[@id="homeruns"]/table') %>%
  html_table()

# View the table
print(table)

Any help would be greatly appreciated!


Solution

  • The webpage uses javascript to display the page so the data is not stored in the table element. Luckily the data is instead it is stored in a "script" element as JSON.

    I looked at the vector of returned scripts and visual identified the fourth script as the one containing the data of interest. So this may break at a later time or on a different page.

    library(rvest)
    library(dplyr)
    url <- "https://baseballsavant.mlb.com/leaderboard/home-runs?player_type=Batter&team=&min=0&cat=adj_xhr&year=2020"
    
    # Read the webpage
    webpage <- read_html(url)
    
    # Extract the table
    scripts <- webpage %>%
       html_elements('script') 
    
    #data is stored in the fourth script - by eye
    myscript<- scripts[4] %>% html_text() %>% trimws()
    
    #Mutliple data strutures are stored, spliting at the ";"
    datafields <- strsplit(myscript, ";")
    #convert the list to vector
    data <- datafields[[1]]
    #peek at the data
    substr(data, 1, 30)
    
    #remove everything before the first [
    data <- sub(".+= (.*)", "\\1", data)
    
    #convert from JSON
    answer <- jsonlite::fromJSON(data[1])
    
    
    dim(answer)
    #[1] 476  51
    

    The final dataframe has 51 columns and 476 rows of infornmation.