Search code examples
javascriptrweb-scrapingrvest

Scraping from website using rvest but the data in the table isn't loading


I'm trying to scrape what appears to be a javascript table and when I use the code below it's returning a dataframe that has the headers but nothing in the body:

library(rvest)
library(tidyverse)

fa_link <- "https://overthecap.com/free-agency"

fa_table <- fa_link %>% 
  read_html() %>% 
  html_element("table") %>% 
  html_table()

I need to scrape this table on a remote server, so I don't think using RSelenium (or a comparable solution) is possible. Is there a way in RVest to get the table's data?


Solution

  • There's a Ajax call to fetch actual table content, once you have identified it in network tab of your browser's dev tools (use the search on some keywords from that table, i.e. "Tom Brady"), you could mimic it with httr/httr2, for example. When missing HTML table pieces are added, it can be parsed with rvest::html_table() :

    library(httr2)
    library(rvest)
    library(stringr)
    
    # make a POST request with action=get_free_agents&season=2023 in form data
    resp <- request("https://overthecap.com/wp-admin/admin-ajax.php") %>% 
      req_body_form(
        action = "get_free_agents",
        season = 2023) %>%
      req_perform()
    
    # check response content
    resp %>% resp_body_string() %>% str_trunc(80)
    #> [1] "\t\t\t\t<tr class=\"sortable\" data-old-team=\"TB\" data-new-team=\"\" data-position=\"Q..."
    
    # response includes table rows
    # fit those into table template from https://overthecap.com/free-agency source,
    # "{.}" in "<tbody>{.}</tbody>" corresponds to resp_body_string() output
    resp %>% resp_body_string() %>% 
      str_glue(
      '<table class="controls-table" id="table2023" cellspacing="0" align="center">
        <thead>
            <tr>
                <th class="sortable">Player</th>
                <th class="sortable sorttable_numeric">Pos.</th>
                <th class="sortable">2022 Team</th>
                <th class="sortable">2023 Team</th>
                <th class="sortable">Type</th>
                <th class="sortable">Snaps</th>
                <th class="sortable">Age</th>
                <th class="sortable">Current APY</th>
                <th class="sortable mobile_drop">Guarantees</th>
            </tr>
        </thead>
        <tbody>{.}</tbody>') %>% 
      # turn it into valid html
      minimal_html() %>% 
      html_element("table") %>% 
      html_table()
    

    Result:

    #> # A tibble: 838 × 9
    #>    Player          Pos.  `2022 Team` 2023 Te…¹ Type  Snaps   Age Curre…² Guara…³
    #>    <chr>           <chr> <chr>       <chr>     <chr> <chr> <int> <chr>   <chr>  
    #>  1 Tom Brady       QB    Buccaneers  ""        Void  98.0%    46 $25,00… $25,00…
    #>  2 Michael Thomas  WR    Saints      ""        Void  12.9%    30 $19,25… $35,64…
    #>  3 Orlando Brown   LT    Chiefs      ""        UFA   98.4%    27 $16,66… $16,66…
    #>  4 Baker Mayfield  QB    Rams        ""        UFA   63.7%    28 $15,35… $15,35…
    #>  5 Deion Jones     LB    Browns      ""        UFA   38.8%    29 $14,25… $18,80…
    #>  6 Marcus Peters   CB    Ravens      ""        UFA   73.2%    30 $14,00… $21,00…
    #>  7 Fletcher Cox    IDL   Eagles      ""        Void  64.5%    33 $14,00… $14,00…
    #>  8 Robert Quinn    EDGE  Eagles      ""        Void  35.9%    33 $14,00… $30,00…
    #>  9 Javon Hargrave  IDL   Eagles      ""        Void  64.4%    30 $13,00… $26,00…
    #> 10 Yannick Ngakoue EDGE  Colts       ""        Void  64.3%    28 $13,00… $21,00…
    #> # … with 828 more rows, and abbreviated variable names ¹​`2023 Team`,
    #> #   ²​`Current APY`, ³​Guarantees
    

    Created on 2023-02-11 with reprex v2.0.2