I am trying to webscrape the following table using R
https://www.nba.com/stats/teams/opponent-shooting
The code I have written is as follows
library(rvest)
url <- "https://www.nba.com/stats/teams/opponent-shooting"
page <- read_html(url)
table_data <- page %>%
html_table(fill=TRUE)
However this appears to return a table of what looks to be a calendar month.
Any idea how to successfully webscrape the intended table please?
From the comments it seems the page might be displayed differently for some, for reference, this is how it renders for me - https://i.sstatic.net/evLZ7.png
Anyway, it's a dynamic JavaScript-driven page and the table content (JSON) is fetched through an API call. Which you can trace down through the network tab of your browser's developer tools. To make that request yourself (and succeed), you'd need to alter request headers a bit; here's one option on how this might look with httr2
:
library(httr2)
req_url <- "https://stats.nba.com/stats/leaguedashteamshotlocations?Conference=&DateFrom=&DateTo=&DistanceRange=5ft%20Range&Division=&GameScope=&GameSegment=&ISTRound=&LastNGames=0&Location=&MeasureType=Opponent&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=PerGame&Period=0&PlayerExperience=&PlayerPosition=&PlusMinus=N&Rank=N&Season=2023-24&SeasonSegment=&SeasonType=Regular%20Season&ShotClockRange=&StarterBench=&TeamID=0&VsConference=&VsDivision="
json <-
request(req_url) |>
req_user_agent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36") |>
req_headers(
Accept = "*/*",
Origin = "https://www.nba.com",
Referer = "https://www.nba.com/",
) |>
req_perform() |>
resp_body_json()
hdr <- json$resultSets$headers
# build column names from 2-level header structure
clean_names <-
c(
rep("", hdr[[1]]$columnsToSkip),
rep(unlist(hdr[[1]]$columnNames), each = 3)
) |>
paste(unlist(hdr[[2]]$columnNames)) |>
janitor::make_clean_names()
clean_names
#> [1] "team_id" "team_name"
#> [3] "less_than_5_ft_opp_fgm" "less_than_5_ft_opp_fga"
#> [5] "less_than_5_ft_opp_fg_pct" "x5_9_ft_opp_fgm"
#> [7] "x5_9_ft_opp_fga" "x5_9_ft_opp_fg_pct"
#> [9] "x10_14_ft_opp_fgm" "x10_14_ft_opp_fga"
#> [11] "x10_14_ft_opp_fg_pct" "x15_19_ft_opp_fgm"
#> [13] "x15_19_ft_opp_fga" "x15_19_ft_opp_fg_pct"
#> [15] "x20_24_ft_opp_fgm" "x20_24_ft_opp_fga"
#> [17] "x20_24_ft_opp_fg_pct" "x25_29_ft_opp_fgm"
#> [19] "x25_29_ft_opp_fga" "x25_29_ft_opp_fg_pct"
#> [21] "x30_34_ft_opp_fgm" "x30_34_ft_opp_fga"
#> [23] "x30_34_ft_opp_fg_pct" "x35_39_ft_opp_fgm"
#> [25] "x35_39_ft_opp_fga" "x35_39_ft_opp_fg_pct"
#> [27] "x40_ft_opp_fgm" "x40_ft_opp_fga"
#> [29] "x40_ft_opp_fg_pct"
# make each list in a `rowSet` a named list,
# this allows us to use dplyr::bind_rows() to create a tibble
json$resultSets$rowSet |>
lapply(setNames, clean_names) |>
dplyr::bind_rows()
Result:
#> # A tibble: 30 × 29
#> team_id team_name less_than_5_ft_opp_fgm less_than_5_ft_opp_fga
#> <int> <chr> <dbl> <dbl>
#> 1 1610612737 Atlanta Hawks 22 33.6
#> 2 1610612738 Boston Celtics 17.4 28.5
#> 3 1610612751 Brooklyn Nets 17.9 28.8
#> 4 1610612766 Charlotte Hornets 20.5 31.8
#> 5 1610612741 Chicago Bulls 18 28.5
#> 6 1610612739 Cleveland Cavaliers 17.2 28.8
#> 7 1610612742 Dallas Mavericks 20 29.5
#> 8 1610612743 Denver Nuggets 18.8 30.3
#> 9 1610612765 Detroit Pistons 20.5 32
#> 10 1610612744 Golden State Warrio… 17 25.8
#> # ℹ 20 more rows
#> # ℹ 25 more variables: less_than_5_ft_opp_fg_pct <dbl>, x5_9_ft_opp_fgm <dbl>,
#> # x5_9_ft_opp_fga <dbl>, x5_9_ft_opp_fg_pct <dbl>, x10_14_ft_opp_fgm <dbl>,
#> # x10_14_ft_opp_fga <dbl>, x10_14_ft_opp_fg_pct <dbl>,
#> # x15_19_ft_opp_fgm <dbl>, x15_19_ft_opp_fga <dbl>,
#> # x15_19_ft_opp_fg_pct <dbl>, x20_24_ft_opp_fgm <dbl>,
#> # x20_24_ft_opp_fga <dbl>, x20_24_ft_opp_fg_pct <dbl>, …
Created on 2024-01-24 with reprex v2.0.2