Search code examples
rweb-scrapingrvesthttr

Web scraping XHR using httr and rvest not working though token has been passed to headers


I am fairly new to web scraping and after searching on SO I found a few examples, which were this, and this;

My attempt to extract the data looks like this:

library(httr)
library(rvest)
library(dplyr)

s <- session("https://www.barchart.com/stocks/highs-lows/highs")

cookies <- s$response$cookies
token <- URLdecode(dplyr::recode("XSRF-TOKEN", 
                                 !!!setNames(cookies$value, 
                                             cookies$name)))

pg <-GET(url="https://www.barchart.com/proxies/core-api/v1/quotes/get",
         add_headers(
                     Referer="https://www.barchart.com/stocks/highs-lows/highs",
                     `Accept`="application/json",
                     `Accept-Encoding`="gzip, deflate",
                     `Connection`="keep-alive",
                     `User-Agent`="Mozilla/5.0 (X11; Linux x86_64; rv:108.0) Gecko/20100101 Firefox/108.0",
                     `X-XSRF-TOKEN`=token
                    ),
         query=list(
                   lists="stocks.us.new_highs_lows.highs.overall.1y",
                   fields="symbol,symbolName,lastPrice,priceChange,percentChange,volume,highHits1y,highPercent1y,lowPercent1y,tradeTime,symbolCode,symbolType,hasOptions",
                   meta="field.shortName,field.type,field.description,lists.lastUpdate",
                   hasOptions="true",
                   page="1",
                   limit="100",
                   raw="1"
                   ),
         verbose()) -> res

data <- content(res, as = "text")
print(data)

This prints nothing. Ideally I should be getting some text, which includes a json object that I can parse (the result of inspection in devtools).

I've spent quite a few hours scratching my head and still don't have a clue just yet. rvest doesn't have request_GET function exposed any more therefore the only option left is httr::GET, and it didn't really work.


Solution

  • I'm not a seasoned web-scraper by any means, but took some time trying to figure this out.

    It seems that the API requires sending cookies in your request or else you will be denied access.

    Note that when you run your code as-is, the result of res$status_code is a 401 Unauthorized Error, which means you aren't being permitted to access the resource.

    I had to use DevTools to inspect the web page, look in the Network tab, and find the file that makes the API request, and then copy/paste the cookie string into R, in addition to adding other headers while testing it out.

    library(httr)
    library(rvest)
    library(dplyr)
    
    
    s <- session("https://www.barchart.com/stocks/highs-lows/highs")
    
    cookies <- s$response$cookies
    token <- URLdecode(dplyr::recode("XSRF-TOKEN", 
                                     !!!setNames(cookies$value, 
                                                 cookies$name)))
    
    # go in your browser dev tools/inspect
    # go to the 'Network' tab and look for the file in the screenshot below
    # copy your very long cookie string here
    cookie <- "your-very-long-cookie-string-goes-here"
    
    
    
    pg <-GET(url="https://www.barchart.com/proxies/core-api/v1/quotes/get",
             add_headers(
               
               `accept-encoding` = "gzip, deflate, br",
               `accept-language` = "en-US,en;q=0.9",
               `cache-control` = "no-cache",
               `cookie` = cookie,
               `pragma` = "no-cache",
               `referer` = "https://www.barchart.com/stocks/highs-lows/highs",
               `sec-ch-ua` = '"Not?A_Brand";v="8", "Chromium";v="108", "Google Chrome";v="108"',
               `sec-ch-ua-mobile` = "?0",
               `sec-ch-ua-platform` = "macOS",
               `sec-fetch-dest` = "empty",
               `sec-fetch-mode` = "cors",
               `sec-fetch-site` = "same-origin",
               `user-agent` = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36",
               `x-xsrf-token` = "eyJpdiI6ImpMNGNJZ2Y1ZnQ5bUM5UzVJRW1vaUE9PSIsInZhbHVlIjoibDBVeXVaKzZqUXc2dmVVVTRFVHd3K0NONlcvTzYrZ2ZwM3dMZWJpQldwMzU3SDlOVSt3RDNsa1dHeGdtaTNpRHd0SDhreldYTzEyelV6SXBNT2hOWGNYakR6djNqdDczb2FoWEtmTWE1LzYxVzR3anRQcGRTMmJCRVpJS1FlUWoiLCJtYWMiOiJmYWZjYjdjMGZjODBkZWJiNGI2OGE5MDQ5MGIyZjc2ZmQ4YzM2ZDk1ZTE4ZTUzMjQzYWE1OTQzOWRiNzZkMTE5In0="
                 
    
             ),
             query=list(
               lists="stocks.us.new_highs_lows.highs.overall.1y",
               fields="symbol,symbolName,lastPrice,priceChange,percentChange,volume,highHits1y,highPercent1y,lowPercent1y,tradeTime,symbolCode,symbolType,hasOptions",
               meta="field.shortName,field.type,field.description,lists.lastUpdate",
               hasOptions="true",
               page="1",
               limit="100",
               raw="1"
             ),
             verbose()) -> res
    
    data <- content(res, as = "text")
    #> No encoding supplied: defaulting to UTF-8.
    
    parsed <- content(res, as = "parsed")
    
    
    
    # parsed result 
    
    purrr::map(parsed$data, function(el) {
      purrr::map_df(el$raw, function(data) {
        return(data)
      })
    }) %>% 
      bind_rows()
    
    
    #> # A tibble: 54 × 13
    #>    symbol symbolName     lastP…¹ price…² percen…³ volume highH…⁴ highP…⁵ lowPe…⁶
    #>    <chr>  <chr>            <dbl>   <dbl>    <dbl>  <int>   <int>   <dbl>   <dbl>
    #>  1 ACBA   Ace Global Bu…   11.0   0.36    0.0337  2.6 e3      31 -0.0036  0.0941
    #>  2 ADMA   Adma Biologics    3.86  0.18    0.0489  3.09e6      40 -0.0153  2.06  
    #>  3 ADRA   Adara Acquisi…   10.2   0.0300  0.00296 1.5 e3      65  0       0.0484
    #>  4 AGFS   Agrofresh Sol…    2.96  0.01    0.0034  2.61e5      16 -0.0067  1.03  
    #>  5 AKO.B  Embotell Andn…   14.5  -0.0150 -0.00104 1.46e5       9 -0.0236  0.503 
    #>  6 AKRO   Akero Therape…   53.8   4.24    0.0855  1.05e6      23 -0.0024  6.16  
    #>  7 AMBC   Ambac Financi…   17.1   0.44    0.0263  7.44e5       7 -0.0029  1.37  
    #>  8 ARDX   Ardelyx Inc       2.53  0.02    0.008   3.52e7      14 -0.0524  4.16  
    #>  9 ARYD   Arya Sciences…   10.1  -0.01   -0.0005  2.9 e3      25 -0.001   0.0402
    #> 10 AURC   Aurora Acquis…   10.1   0.05    0.0045  2.07e4      13 -0.0005  0.0302
    #> # … with 44 more rows, 4 more variables: tradeTime <int>, symbolCode <chr>,
    #> #   symbolType <int>, hasOptions <lgl>, and abbreviated variable names
    #> #   ¹​lastPrice, ²​priceChange, ³​percentChange, ⁴​highHits1y, ⁵​highPercent1y,
    #> #   ⁶​lowPercent1y
    

    I tried to use the cookie dataframe to format my cookie in the exact same way, but for some reason it didn't like that.

    Here's the file where you can copy the cookie string:

    enter image description here