Search code examples
htmlrweb-scrapingrvest

Why does my attempted rvest web scrape yield too much or too little data?


I am trying to scrape a single table from the following URL: https://baseballsavant.mlb.com/league?season=2023#statcastHitting. However, my attempts either scrape multiple tables on the broader page or else I get an output tibble of 0x0. I simply want to scrape the "Statcast Hitting" table near the top of the page, sans spanner titles. I used Selector Gadget to try and pinpoint the correct nodes however I suspect I am not referencing the html properly. An image of what I'm trying to scrape and convert to a dataframe is shown below.

enter image description here

I have tried several times, with a couple of failed attempts below.

  1. First attempt successfully pulls the entire page but this obviously overshoots the mark:
library(rvest)
library(tidyverse)
url <- 'https://baseballsavant.mlb.com/league?season=2023#statcastHitting'

savant_teams <- url %>% read_html %>% html_node('#statcastHitting') %>%
html_table() 

savant_teams
  1. Second attempt (once again using html pasted from Selector Gadget) yields a 0x0 tibble (and the html_node language is way longer than I was expecting):
library(tidyverse)
library(rvest)
url <- 'https://baseballsavant.mlb.com/league?season=2023#statcastHitting'
savant_teams <- url %>% read_html %>% html_node('#statcast_th-8 .tablesorter-header-inner , #statcast_th-7 , #statcast_th-6 .tablesorter-header-inner , #statcast_th-5 .tablesorter-header-inner , #statcast_th-10 .tablesorter-header-inner , #statcast_th-4 .tablesorter-header-inner , #statcast_th-2 .tablesorter-header-inner , #statcast_th-9 , #statcast_th-1 .tablesorter-header-inner , #statcast_th-3 .tablesorter-header-inner , .tablesorterb23e763259572 #statcast_th-0 .tablesorter-header-inner , #scg_ span') %>%
  html_table() 
savant_teams

Solution

  • The tables are wrapped in a div with class table-savant. And as the table you want to scrape is the first you could use the selector #statcastHitting div.table-savant to select the first div to get only the table inside that div:

    library(rvest)
    library(tidyverse)
    url <- 'https://baseballsavant.mlb.com/league?season=2023#statcastHitting'
    
    savant_teams <- url %>% 
      read_html() %>% 
      html_node('#statcastHitting div.table-savant') %>%
      html_table()
    
    savant_teams
    #> # A tibble: 32 × 27
    #>    ``     ``     `Standard Stats` `Standard Stats` `Standard Stats`
    #>    <chr>  <chr>  <chr>            <chr>            <chr>           
    #>  1 "Team" Season PA               AB               H               
    #>  2 ""     2023   6,249            5,597            1,543           
    #>  3 ""     2023   5,985            5,428            1,325           
    #>  4 ""     2023   6,207            5,541            1,417           
    #>  5 ""     2023   5,980            5,501            1,308           
    #>  6 ""     2023   6,253            5,567            1,441           
    #>  7 ""     2023   6,219            5,489            1,336           
    #>  8 ""     2023   5,966            5,311            1,187           
    #>  9 ""     2023   6,164            5,511            1,432           
    #> 10 ""     2023   6,180            5,401            1,316           
    #> # ℹ 22 more rows
    #> # ℹ 22 more variables: `Standard Stats` <chr>, `Standard Stats` <chr>,
    #> #   `Standard Stats` <chr>, `Standard Stats` <chr>, `Standard Stats` <chr>,
    #> #   `Standard Stats` <chr>, `Standard Stats` <chr>, `Standard Stats` <chr>,
    #> #   `Standard Stats` <chr>, `Standard Stats` <chr>, Statcast <chr>,
    #> #   Statcast <chr>, Statcast <chr>, Statcast <chr>, Statcast <chr>,
    #> #   Statcast <chr>, Statcast <chr>, Statcast <chr>, Statcast <chr>, …