I need to web scrape the table information from the following link, using R or Python: https://euroleaguefantasy.euroleaguebasketball.net/en/stats-fantasy-euroleague
So far I have tried the rvest
package but no luck.
url <- "https://euroleaguefantasy.euroleaguebasketball.net/en/stats-fantasy-euroleague"
library(rvest)
read_html(url)
#> {html_document}
#> <html lang="en">
#> [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
#> [2] <body class="loading">\n<app-root></app-root><button id="ot-sdk-btn" clas ...
Created on 2023-10-06 with reprex v2.0.2
I cannot retrieve, or I do not know how to retrieve, any content from here, since read_html(URL)[1]
or read_html(URL)[2]
provides no content.
How can I continue?
The page from which you are trying to scrape is a dynamic web page. That means that the table contents are not present in the html you are downloading via read_html
. Instead, the html contains JavaScript code that fetches the data in json format from an API to populate the table. This JavaScript will automatically be run in your browser, which is why you see the table, but will not be run in R when you use read_html
.
You can tackle this in one of two ways. Either use browser automation such as Selenium, or use your browser's console to find the API request that will return the raw data. I usually find the second solution gives more control over how you read and handle the data, and will show it here.
First, get the url from your browser's console and put it in R (I have broken the url up into several pieces and using paste
to put them back together just so they fit on the screen)
url <- paste0("https://www.dunkest.com/api/stats/table",
"?season_id=15&mode=dunkest&stats_type=avg",
"&weeks[]=1&rounds[]=1&rounds[]=2&teams[]=31",
"&teams[]=32&teams[]=33&teams[]=34&teams[]=35",
"&teams[]=36&teams[]=37&teams[]=38&teams[]=39",
"&teams[]=40&teams[]=41&teams[]=42&teams[]=43",
"&teams[]=44&teams[]=45&teams[]=46&teams[]=47",
"&teams[]=48&positions[]=1&positions[]=2",
"&positions[]=3&player_search=&min_cr=4",
"&max_cr=35&sort_by=pdk&sort_order=desc&iframe=yes")
Now we do
jsonlite::read_json(url) |>
lapply(as.data.frame) |>
lapply(\(x) sapply(x, as.character)) |>
dplyr::bind_rows()
#> # A tibble: 111 x 42
#> id gp first_name last_n~1 cr team_id team_~2 team_~3 posit~4 posit~5
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 1185 1 Matt Thomas 8.3 31 BER ALBA B~ 1 G
#> 2 1187 1 Shabazz Napier 13.3 35 CZV Crvena~ 1 G
#> 3 1198 1 Achille Polonara 9.8 47 VIR Virtus~ 2 F
#> 4 1206 1 Timothe Luwawu-~ 10.3 40 ASV LDLC A~ 2 F
#> 5 1213 1 Tornike Shengel~ 12.3 47 VIR Virtus~ 2 F
#> 6 1218 1 Shane Larkin 14.5 32 EFS Anadol~ 1 G
#> 7 1222 1 Milos Teodosic 12.3 35 CZV Crvena~ 1 G
#> 8 1228 1 Jan Vesely 12.8 37 BAR FC Bar~ 3 C
#> 9 1230 1 Balsa Koprivi~ 6.6 44 PAR Partiz~ 3 C
#> 10 1231 1 Lorenzo Brown 14.8 41 MTA Maccab~ 1 G
#> # ... with 101 more rows, 32 more variables: pdk <chr>, plus <chr>, min <chr>,
#> # starter <chr>, pts <chr>, ast <chr>, reb <chr>, stl <chr>, blk <chr>,
#> # blka <chr>, fgm <chr>, fgm_tot <chr>, fga <chr>, fga_tot <chr>, tpm <chr>,
#> # tpm_tot <chr>, tpa <chr>, tpa_tot <chr>, ftm <chr>, ftm_tot <chr>,
#> # fta <chr>, fta_tot <chr>, oreb <chr>, dreb <chr>, tov <chr>, pf <chr>,
#> # fouls_received <chr>, plus_minus <chr>, fgp <chr>, tpp <chr>, ftp <chr>,
#> # slug <chr>, and abbreviated variable names 1: last_name, 2: team_code, ...
Created on 2023-10-06 with reprex v2.0.2