Search code examples
pythonrweb-scrapingrvest

Web scraping of a sports table in R


I need to web scrape the table information from the following link, using R or Python: https://euroleaguefantasy.euroleaguebasketball.net/en/stats-fantasy-euroleague

So far I have tried the rvest package but no luck.

url <- "https://euroleaguefantasy.euroleaguebasketball.net/en/stats-fantasy-euroleague"

library(rvest)
read_html(url)
#> {html_document}
#> <html lang="en">
#> [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
#> [2] <body class="loading">\n<app-root></app-root><button id="ot-sdk-btn" clas ...

Created on 2023-10-06 with reprex v2.0.2

I cannot retrieve, or I do not know how to retrieve, any content from here, since read_html(URL)[1] or read_html(URL)[2] provides no content.

How can I continue?


Solution

  • The page from which you are trying to scrape is a dynamic web page. That means that the table contents are not present in the html you are downloading via read_html. Instead, the html contains JavaScript code that fetches the data in json format from an API to populate the table. This JavaScript will automatically be run in your browser, which is why you see the table, but will not be run in R when you use read_html.

    You can tackle this in one of two ways. Either use browser automation such as Selenium, or use your browser's console to find the API request that will return the raw data. I usually find the second solution gives more control over how you read and handle the data, and will show it here.

    First, get the url from your browser's console and put it in R (I have broken the url up into several pieces and using paste to put them back together just so they fit on the screen)

    url <- paste0("https://www.dunkest.com/api/stats/table",
    "?season_id=15&mode=dunkest&stats_type=avg",
    "&weeks[]=1&rounds[]=1&rounds[]=2&teams[]=31",
    "&teams[]=32&teams[]=33&teams[]=34&teams[]=35",
    "&teams[]=36&teams[]=37&teams[]=38&teams[]=39",
    "&teams[]=40&teams[]=41&teams[]=42&teams[]=43",
    "&teams[]=44&teams[]=45&teams[]=46&teams[]=47",
    "&teams[]=48&positions[]=1&positions[]=2",
    "&positions[]=3&player_search=&min_cr=4",
    "&max_cr=35&sort_by=pdk&sort_order=desc&iframe=yes")
    

    Now we do

    jsonlite::read_json(url) |>
      lapply(as.data.frame) |> 
      lapply(\(x) sapply(x, as.character)) |>
      dplyr::bind_rows()
    #> # A tibble: 111 x 42
    #>    id    gp    first_name last_n~1 cr    team_id team_~2 team_~3 posit~4 posit~5
    #>    <chr> <chr> <chr>      <chr>    <chr> <chr>   <chr>   <chr>   <chr>   <chr>  
    #>  1 1185  1     Matt       Thomas   8.3   31      BER     ALBA B~ 1       G      
    #>  2 1187  1     Shabazz    Napier   13.3  35      CZV     Crvena~ 1       G      
    #>  3 1198  1     Achille    Polonara 9.8   47      VIR     Virtus~ 2       F      
    #>  4 1206  1     Timothe    Luwawu-~ 10.3  40      ASV     LDLC A~ 2       F      
    #>  5 1213  1     Tornike    Shengel~ 12.3  47      VIR     Virtus~ 2       F      
    #>  6 1218  1     Shane      Larkin   14.5  32      EFS     Anadol~ 1       G      
    #>  7 1222  1     Milos      Teodosic 12.3  35      CZV     Crvena~ 1       G      
    #>  8 1228  1     Jan        Vesely   12.8  37      BAR     FC Bar~ 3       C      
    #>  9 1230  1     Balsa      Koprivi~ 6.6   44      PAR     Partiz~ 3       C      
    #> 10 1231  1     Lorenzo    Brown    14.8  41      MTA     Maccab~ 1       G      
    #> # ... with 101 more rows, 32 more variables: pdk <chr>, plus <chr>, min <chr>,
    #> #   starter <chr>, pts <chr>, ast <chr>, reb <chr>, stl <chr>, blk <chr>,
    #> #   blka <chr>, fgm <chr>, fgm_tot <chr>, fga <chr>, fga_tot <chr>, tpm <chr>,
    #> #   tpm_tot <chr>, tpa <chr>, tpa_tot <chr>, ftm <chr>, ftm_tot <chr>,
    #> #   fta <chr>, fta_tot <chr>, oreb <chr>, dreb <chr>, tov <chr>, pf <chr>,
    #> #   fouls_received <chr>, plus_minus <chr>, fgp <chr>, tpp <chr>, ftp <chr>,
    #> #   slug <chr>, and abbreviated variable names 1: last_name, 2: team_code, ...
    

    Created on 2023-10-06 with reprex v2.0.2