I am trying to write code that allows me to scrape the shooting accuracy table from the basketball website cleaningtheglass.com. I've tried finding the CSS selector to extract the table, but I must have been doing something wrong because I kept on getting nothing.
Here is my code:
library(rvest)
library(tidyverse)
url <- "https://cleaningtheglass.com/stats/players?stat_category=shooting_overall#/"
# Read the HTML content of the webpage
webpage <- url %>%
read_html()
# Use the specific CSS selector for the table
table_data <- page %>%
html_nodes('#shooting_overall > div.stat_table_container')
What am I doing wrong?
This is a somewhat interesting case. The reason why your selectors are not working are already stated in comments -- it's a JavaScript-driven site and what you see in CSS Selector or in inspector of browser's dev tools is quite different from the actual page source that is available for rvest
. Though that table data is embedded in the site's source and delivered in a single response, that itself isn't too uncommon. But the size of that dataset is ~13MB and it's all squeezed into a single <script> .. </script>
element; apparently rvest::html_text()
is not able to extract all of it and returns a truncated string.
So instead of rvest
we can load the page content with httr(2)
and process lines of text; after locating relevant js function call, we can extract function arguments (JavaScript arrays and objects), each of those conveniently on its own line. Once those strings are clean enough (e.g. no trailing commas, extra whitespace is fine), we can parse those objects as JSON strings.
library(httr2)
library(readr)
library(dplyr)
library(stringr)
url_ <- "https://cleaningtheglass.com/stats/players?stat_category=shooting_overall#/"
# read html as as lines
html_l <-
request(url_) |>
req_perform() |>
resp_body_string() |>
read_lines()
# locate target js assignment and get get relevant vuePlayers() function argument values
idx_anchor <- which(html_l == " window.vuePlayerFilter = vuePlayers(")
players <- html_l[(idx_anchor+1):(idx_anchor+4)]
# names from js function arguments
names(players) <- c("allPlayerData", "onOffTeamData", "onOffOpponentData", "statCategoryMappings")
# check start & end for anything that might cause issues for jsonlite
tibble(arg = names(players),
start = sapply(players, str_trunc, 20, side ="right"),
end = sapply(players, str_trunc, 20, side ="left"))
#> # A tibble: 4 × 3
#> arg start end
#> <chr> <chr> <chr>
#> 1 allPlayerData " [{\"th..." "...attempts\": 108}],"
#> 2 onOffTeamData " [{\"cn..." "...hort_fg\": null}],"
#> 3 onOffOpponentData " [{\"pt..." "...name\": \"Reath\"}],"
#> 4 statCategoryMappings " [[\"of..." "...ve Rebounds\"}]]],"
# remove trailing commas
players <- sapply(players, \(x) gsub(",$", "", x))
# parse all function arguments as JSONs
players <- lapply(players, jsonlite::fromJSON)
# looks like we have a view config for default table
players$statCategoryMappings[[2]] |> str()
#> List of 2
#> $ : chr "shooting_overall"
#> $ :'data.frame': 8 obs. of 4 variables:
#> ..$ abbr: chr [1:8] "efg_perc" "fg2_perc" "fg3_perc" "ft_perc" ...
#> ..$ type: chr [1:8] "percent1" "percent1" "percent1" "percent1" ...
#> ..$ name: chr [1:8] "eFG%" "2P%" "3P%" "FT%" ...
#> ..$ sort: int [1:8] NA NA NA NA 0 0 0 0
# named column name vector for select
stat_map <-
players$statCategoryMappings[[2]][[2]][,c("name", "abbr")] |>
mutate(name = gsub("<br />", " ", name, fixed = TRUE)) |>
tibble::deframe()
stat_map
#> eFG% 2P% 3P% FT%
#> "efg_perc" "fg2_perc" "fg3_perc" "ft_perc"
#> ASTD% All ASTD% Rim ASTD% Mid ASTD% Three
#> "astd_perc" "astd_rim_perc" "astd_nr2_perc" "astd_three_perc"
# and allPlayerData, subset that matches with the site's default table
# (total number of columns is 111)
players$allPlayerData %>%
as_tibble() %>%
select(name, age, team = team_name, pos = pos_category,
sec_played = seconds_played, all_of(stat_map))
Result:
#> # A tibble: 306 × 13
#> name age team pos sec_played `eFG%` `2P%` `3P%` `FT%` `ASTD% All`
#> <chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Precious … 24.1 TOR big 10311 0.509 0.548 0.267 0.778 0.815
#> 2 Bam Adeba… 26.3 MIA big 26814 0.540 0.538 0.5 0.825 0.556
#> 3 Ochai Agb… 23.5 UTA wing 16443 0.542 0.5 0.382 0.667 0.846
#> 4 Santi Ald… 22.8 MEM big 12095 0.543 0.519 0.372 0.6 0.814
#> 5 Nickeil A… 25.2 MIN wing 16807 0.565 0.6 0.364 0.333 0.774
#> 6 Grayson A… 28.1 PHX wing 28936 0.627 0.5 0.474 0.864 0.705
#> 7 Jarrett A… 25.5 CLE big 16405 0.622 0.622 NA 0.756 0.804
#> 8 Kyle Ande… 30.1 MIN forw… 20847 0.579 0.612 0.222 0.581 0.535
#> 9 Giannis A… 28.9 MIL big 27676 0.615 0.647 0.222 0.625 0.45
#> 10 Cole Anth… 23.5 ORL point 22301 0.5 0.478 0.351 0.848 0.484
#> # ℹ 296 more rows
#> # ℹ 3 more variables: `ASTD% Rim` <dbl>, `ASTD% Mid` <dbl>, `ASTD% Three` <dbl>
Created on 2023-11-25 with reprex v2.0.2