Search code examples
rweb-scrapingrvest

R Screen Scrape with dropdown value


On this website, https://www.covers.com/sport/basketball/nba/matchup/290850/props, this is a dynamic page in 2 ways:

  1. Dependent on the same selected (where the #######/props value would be different based on the game)
  2. This screen has a dropdown selector, called "props" which upon changing, would update the page to display the information associated with the selector.

A) I am struggling using rvest to try and load each "row of information (below) into a table i) Name 2) Prop value 3) prediction value 4) Best odds 5) Analysis

B) In R, how would you update the "props" selector and thus be able to download the same data from A.

I have been beating my head against my monitor as I can get down to a "node level" but am struggling how I would parse out the info at the lowest level

covers_page <- "https://www.covers.com/sport/basketball/nba/matchup/290850/props"

tmp <- read_html(covers_page)

nodes_1 <- tmp %>% html_elements("div") %>% xml_find_all("//div[contains(@class,'player-props-table-container')]"


Solution

  • B: when checking HTTP requests through the network tab of browser's dev. tools, you should notice that "props" drop-down triggers Ajax calls (e.g. ... /290850/market?propEvent=NBA_GAME_PLAYER_POINTS), fetching table content for every prop; as rvest can't run javascript, urls should be crafted from drop-down item values, so first we need a list of those :

    library(rvest)
    library(dplyr)
    library(purrr)
    library(stringr)
    
    url_ <- "https://www.covers.com/sport/basketball/nba/matchup/290850/props"
    prop_events <- 
      read_html(url_) |>
      html_elements("li[data-event-name]") |>
      map(\(elem) list(event = html_attr(elem, "data-event-name"),
                       descr = html_text(elem))) |>
      bind_rows()
    prop_events
    #> # A tibble: 12 × 2
    #>    event                                   descr                              
    #>    <chr>                                   <chr>                              
    #>  1 NBA_GAME_PLAYER_POINTS                  Points Scored                      
    #>  2 NBA_GAME_PLAYER_POINTS_REBOUNDS         Points and Rebounds                
    #>  3 NBA_GAME_PLAYER_POINTS_ASSISTS          Points and Assists                 
    #>  4 NBA_GAME_PLAYER_3_POINTERS_MADE         3-Pointers Made                    
    #>  5 NBA_GAME_PLAYER_REBOUNDS_ASSISTS        Rebounds and Assists               
    #>  6 NBA_GAME_PLAYER_STEALS_BLOCKS           Steals and Blocks                  
    #>  7 NBA_GAME_PLAYER_BLOCKS                  Total Blocks                       
    #>  8 NBA_GAME_PLAYER_STEALS                  Total Steals                       
    #>  9 NBA_GAME_PLAYER_REBOUNDS                Total Rebounds                     
    #> 10 NBA_GAME_PLAYER_POINTS_REBOUNDS_ASSISTS Total Points, Rebounds, and Assists
    #> 11 NBA_GAME_PLAYER_TURNOVERS               Total Turnovers                    
    #> 12 NBA_GAME_PLAYER_ASSISTS                 Total Assists
    
    # url for props Ajax calls
    (url_market <- str_replace(url_, "props$", "market?propEvent="))
    #> [1] "https://www.covers.com/sport/basketball/nba/matchup/290850/market?propEvent="
    

    A: you'd generally want to be more specific with your CSS selectors than just plain div. Elements returned from html_element() / html_elements() can be passed to next html_element() / html_elements() calls, meaning that you can first select all articles ( article.player-prop-article) and then iterate through the element list and extract bits of interest from each individual article.

    # fetch content and process rows (player-prop-article), return tibble
    parse_prop <- function(event_url){
      read_html(event_url) |>
      html_elements("article.player-prop-article") |>
      map(\(art) list(
        name = html_element(art, ".player-headshot-name strong") |> html_text(),
        team = html_element(art, ".player-headshot-name > div") |> html_text() |> str_split_i("\r\n", 3) |> str_squish(),
        prop = html_element(art, ".player-props-projection-bestOdds-div > div:nth-child(1) strong") |> html_text(),
        proj = html_element(art, ".player-props-projection-bestOdds-div > div:nth-child(2) strong") |> html_text(),
        odds = html_element(art, ".player-bestOdds-row > a > div > span") |> html_text(),
        art  = html_element(art, ".player-analysis") |> html_text())) |>
      bind_rows()
    }
    
    # call parse_prop() on first three propEvents,
    props <- 
      prop_events$event[1:3] |>
      set_names() |>
      map(\(event) str_c(url_market, event)) |>
      map(parse_prop, .progress = TRUE) |>
      list_rbind(names_to = "prop_event")
    props
    #> # A tibble: 39 × 7
    #>    prop_event             name               team        prop  proj  odds  art  
    #>    <chr>                  <chr>              <chr>       <chr> <chr> <chr> <chr>
    #>  1 NBA_GAME_PLAYER_POINTS Ja Morant          PG • Memph… 25.5  22.4  -120  Offe…
    #>  2 NBA_GAME_PLAYER_POINTS Jaren Jackson Jr.  PF • Memph… 18.5  21.5  -125  Jare…
    #>  3 NBA_GAME_PLAYER_POINTS Jonas Valanciunas  C • New Or… 15.5  14.2  -114  Out …
    #>  4 NBA_GAME_PLAYER_POINTS Vince Williams Jr. SG • Memph… 6.5   8.2   -150  Vinc…
    #>  5 NBA_GAME_PLAYER_POINTS CJ McCollum        SG • New O… 17.5  19.4  -125  CJ M…
    #>  6 NBA_GAME_PLAYER_POINTS Santi Aldama       PF • Memph… 8.5   9.5   -110  The …
    #>  7 NBA_GAME_PLAYER_POINTS Herbert Jones      PF • New O… 9.5   10.2  -110  Herb…
    #>  8 NBA_GAME_PLAYER_POINTS Trey Murphy III    SF • New O… 12.5  13.5  -130  Amon…
    #>  9 NBA_GAME_PLAYER_POINTS Bismack Biyombo    C • Memphis 6.5   6     -140  Bism…
    #> 10 NBA_GAME_PLAYER_POINTS David Roddy        SF • Memph… 7.5   7.9   -106  Davi…
    #> # ℹ 29 more rows
    
    

    Perhaps bit more common approach is to extract column vectors from document / parent element and combine those to data.frame / tibble, something like this:

    html <- read_html("https://www.covers.com/sport/basketball/nba/matchup/290850/market?propEvent=NBA_GAME_PLAYER_POINTS")
    tibble(
      name = html_elements(html, ".player-headshot-name strong") |> html_text(),
      prop = html_elements(html, ".player-props-projection-bestOdds-div > div:nth-child(1) strong") |> html_text(),
      proj = html_elements(html, ".player-props-projection-bestOdds-div > div:nth-child(2) strong") |> html_text()
    )
    

    While it also tends to be faster than iterating over elements, it's somewhat less robust as it only works when there's no chance that input vectors could end up with different lengths.