Search code examples
rweb-scrapingrvest

Issue Scraping "Cleaning The Glass" Table


I am trying to write code that allows me to scrape the shooting accuracy table from the basketball website cleaningtheglass.com. I've tried finding the CSS selector to extract the table, but I must have been doing something wrong because I kept on getting nothing.

Here is my code:

library(rvest)
library(tidyverse)

url <- "https://cleaningtheglass.com/stats/players?stat_category=shooting_overall#/"

# Read the HTML content of the webpage
webpage <- url %>%
  read_html()
  

# Use the specific CSS selector for the table
table_data <- page %>%
  html_nodes('#shooting_overall > div.stat_table_container') 

What am I doing wrong?


Solution

  • This is a somewhat interesting case. The reason why your selectors are not working are already stated in comments -- it's a JavaScript-driven site and what you see in CSS Selector or in inspector of browser's dev tools is quite different from the actual page source that is available for rvest. Though that table data is embedded in the site's source and delivered in a single response, that itself isn't too uncommon. But the size of that dataset is ~13MB and it's all squeezed into a single <script> .. </script> element; apparently rvest::html_text() is not able to extract all of it and returns a truncated string.

    So instead of rvest we can load the page content with httr(2) and process lines of text; after locating relevant js function call, we can extract function arguments (JavaScript arrays and objects), each of those conveniently on its own line. Once those strings are clean enough (e.g. no trailing commas, extra whitespace is fine), we can parse those objects as JSON strings.

    library(httr2)
    library(readr)
    library(dplyr)
    library(stringr)
    
    url_ <- "https://cleaningtheglass.com/stats/players?stat_category=shooting_overall#/"
    
    # read html as as lines
    html_l <- 
      request(url_) |>
      req_perform() |>
      resp_body_string() |>
      read_lines()
    
    # locate target js assignment and get get relevant vuePlayers() function argument values
    idx_anchor <- which(html_l == "        window.vuePlayerFilter = vuePlayers(")
    players <- html_l[(idx_anchor+1):(idx_anchor+4)]
    
    # names from js function arguments
    names(players) <- c("allPlayerData", "onOffTeamData", "onOffOpponentData", "statCategoryMappings")
    
    # check start & end for anything that might cause issues for jsonlite
    tibble(arg   = names(players), 
           start = sapply(players, str_trunc, 20, side ="right"),
           end   = sapply(players, str_trunc, 20, side ="left"))
    #> # A tibble: 4 × 3
    #>   arg                  start                   end                      
    #>   <chr>                <chr>                   <chr>                    
    #> 1 allPlayerData        "            [{\"th..." "...attempts\": 108}],"  
    #> 2 onOffTeamData        "            [{\"cn..." "...hort_fg\": null}],"  
    #> 3 onOffOpponentData    "            [{\"pt..." "...name\": \"Reath\"}],"
    #> 4 statCategoryMappings "            [[\"of..." "...ve Rebounds\"}]]],"
    # remove trailing commas 
    players <- sapply(players, \(x) gsub(",$", "", x))
    
    # parse all function arguments as JSONs
    players <- lapply(players, jsonlite::fromJSON)
    
    # looks like we have a view config for default table 
    players$statCategoryMappings[[2]] |> str()
    #> List of 2
    #>  $ : chr "shooting_overall"
    #>  $ :'data.frame':    8 obs. of  4 variables:
    #>   ..$ abbr: chr [1:8] "efg_perc" "fg2_perc" "fg3_perc" "ft_perc" ...
    #>   ..$ type: chr [1:8] "percent1" "percent1" "percent1" "percent1" ...
    #>   ..$ name: chr [1:8] "eFG%" "2P%" "3P%" "FT%" ...
    #>   ..$ sort: int [1:8] NA NA NA NA 0 0 0 0
    
    # named column name vector for select
    stat_map <- 
      players$statCategoryMappings[[2]][[2]][,c("name", "abbr")] |> 
      mutate(name = gsub("<br />", " ", name, fixed = TRUE)) |>
      tibble::deframe()
    stat_map
    #>              eFG%               2P%               3P%               FT% 
    #>        "efg_perc"        "fg2_perc"        "fg3_perc"         "ft_perc" 
    #>         ASTD% All         ASTD% Rim         ASTD% Mid       ASTD% Three 
    #>       "astd_perc"   "astd_rim_perc"   "astd_nr2_perc" "astd_three_perc"
    
    # and allPlayerData, subset that matches with the site's default table
    # (total number of columns is 111)
    players$allPlayerData %>% 
      as_tibble() %>% 
      select(name, age, team = team_name, pos = pos_category, 
             sec_played = seconds_played, all_of(stat_map)) 
    

    Result:

    #> # A tibble: 306 × 13
    #>    name         age team  pos   sec_played `eFG%` `2P%`  `3P%` `FT%` `ASTD% All`
    #>    <chr>      <dbl> <chr> <chr>      <dbl>  <dbl> <dbl>  <dbl> <dbl>       <dbl>
    #>  1 Precious …  24.1 TOR   big        10311  0.509 0.548  0.267 0.778       0.815
    #>  2 Bam Adeba…  26.3 MIA   big        26814  0.540 0.538  0.5   0.825       0.556
    #>  3 Ochai Agb…  23.5 UTA   wing       16443  0.542 0.5    0.382 0.667       0.846
    #>  4 Santi Ald…  22.8 MEM   big        12095  0.543 0.519  0.372 0.6         0.814
    #>  5 Nickeil A…  25.2 MIN   wing       16807  0.565 0.6    0.364 0.333       0.774
    #>  6 Grayson A…  28.1 PHX   wing       28936  0.627 0.5    0.474 0.864       0.705
    #>  7 Jarrett A…  25.5 CLE   big        16405  0.622 0.622 NA     0.756       0.804
    #>  8 Kyle Ande…  30.1 MIN   forw…      20847  0.579 0.612  0.222 0.581       0.535
    #>  9 Giannis A…  28.9 MIL   big        27676  0.615 0.647  0.222 0.625       0.45 
    #> 10 Cole Anth…  23.5 ORL   point      22301  0.5   0.478  0.351 0.848       0.484
    #> # ℹ 296 more rows
    #> # ℹ 3 more variables: `ASTD% Rim` <dbl>, `ASTD% Mid` <dbl>, `ASTD% Three` <dbl>
    

    Created on 2023-11-25 with reprex v2.0.2