I am fairly new to web scraping and after searching on SO I found a few examples, which were this, and this;
My attempt to extract the data looks like this:
library(httr)
library(rvest)
library(dplyr)
s <- session("https://www.barchart.com/stocks/highs-lows/highs")
cookies <- s$response$cookies
token <- URLdecode(dplyr::recode("XSRF-TOKEN",
!!!setNames(cookies$value,
cookies$name)))
pg <-GET(url="https://www.barchart.com/proxies/core-api/v1/quotes/get",
add_headers(
Referer="https://www.barchart.com/stocks/highs-lows/highs",
`Accept`="application/json",
`Accept-Encoding`="gzip, deflate",
`Connection`="keep-alive",
`User-Agent`="Mozilla/5.0 (X11; Linux x86_64; rv:108.0) Gecko/20100101 Firefox/108.0",
`X-XSRF-TOKEN`=token
),
query=list(
lists="stocks.us.new_highs_lows.highs.overall.1y",
fields="symbol,symbolName,lastPrice,priceChange,percentChange,volume,highHits1y,highPercent1y,lowPercent1y,tradeTime,symbolCode,symbolType,hasOptions",
meta="field.shortName,field.type,field.description,lists.lastUpdate",
hasOptions="true",
page="1",
limit="100",
raw="1"
),
verbose()) -> res
data <- content(res, as = "text")
print(data)
This prints nothing. Ideally I should be getting some text, which includes a json object that I can parse (the result of inspection in devtools).
I've spent quite a few hours scratching my head and still don't have a clue just yet. rvest
doesn't have request_GET
function exposed any more therefore the only option left is httr::GET
, and it didn't really work.
I'm not a seasoned web-scraper by any means, but took some time trying to figure this out.
It seems that the API requires sending cookies in your request or else you will be denied access.
Note that when you run your code as-is, the result of res$status_code
is a 401 Unauthorized Error, which means you aren't being permitted to access the resource.
I had to use DevTools to inspect the web page, look in the Network tab, and find the file that makes the API request, and then copy/paste the cookie string into R, in addition to adding other headers while testing it out.
library(httr)
library(rvest)
library(dplyr)
s <- session("https://www.barchart.com/stocks/highs-lows/highs")
cookies <- s$response$cookies
token <- URLdecode(dplyr::recode("XSRF-TOKEN",
!!!setNames(cookies$value,
cookies$name)))
# go in your browser dev tools/inspect
# go to the 'Network' tab and look for the file in the screenshot below
# copy your very long cookie string here
cookie <- "your-very-long-cookie-string-goes-here"
pg <-GET(url="https://www.barchart.com/proxies/core-api/v1/quotes/get",
add_headers(
`accept-encoding` = "gzip, deflate, br",
`accept-language` = "en-US,en;q=0.9",
`cache-control` = "no-cache",
`cookie` = cookie,
`pragma` = "no-cache",
`referer` = "https://www.barchart.com/stocks/highs-lows/highs",
`sec-ch-ua` = '"Not?A_Brand";v="8", "Chromium";v="108", "Google Chrome";v="108"',
`sec-ch-ua-mobile` = "?0",
`sec-ch-ua-platform` = "macOS",
`sec-fetch-dest` = "empty",
`sec-fetch-mode` = "cors",
`sec-fetch-site` = "same-origin",
`user-agent` = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36",
`x-xsrf-token` = "eyJpdiI6ImpMNGNJZ2Y1ZnQ5bUM5UzVJRW1vaUE9PSIsInZhbHVlIjoibDBVeXVaKzZqUXc2dmVVVTRFVHd3K0NONlcvTzYrZ2ZwM3dMZWJpQldwMzU3SDlOVSt3RDNsa1dHeGdtaTNpRHd0SDhreldYTzEyelV6SXBNT2hOWGNYakR6djNqdDczb2FoWEtmTWE1LzYxVzR3anRQcGRTMmJCRVpJS1FlUWoiLCJtYWMiOiJmYWZjYjdjMGZjODBkZWJiNGI2OGE5MDQ5MGIyZjc2ZmQ4YzM2ZDk1ZTE4ZTUzMjQzYWE1OTQzOWRiNzZkMTE5In0="
),
query=list(
lists="stocks.us.new_highs_lows.highs.overall.1y",
fields="symbol,symbolName,lastPrice,priceChange,percentChange,volume,highHits1y,highPercent1y,lowPercent1y,tradeTime,symbolCode,symbolType,hasOptions",
meta="field.shortName,field.type,field.description,lists.lastUpdate",
hasOptions="true",
page="1",
limit="100",
raw="1"
),
verbose()) -> res
data <- content(res, as = "text")
#> No encoding supplied: defaulting to UTF-8.
parsed <- content(res, as = "parsed")
# parsed result
purrr::map(parsed$data, function(el) {
purrr::map_df(el$raw, function(data) {
return(data)
})
}) %>%
bind_rows()
#> # A tibble: 54 × 13
#> symbol symbolName lastP…¹ price…² percen…³ volume highH…⁴ highP…⁵ lowPe…⁶
#> <chr> <chr> <dbl> <dbl> <dbl> <int> <int> <dbl> <dbl>
#> 1 ACBA Ace Global Bu… 11.0 0.36 0.0337 2.6 e3 31 -0.0036 0.0941
#> 2 ADMA Adma Biologics 3.86 0.18 0.0489 3.09e6 40 -0.0153 2.06
#> 3 ADRA Adara Acquisi… 10.2 0.0300 0.00296 1.5 e3 65 0 0.0484
#> 4 AGFS Agrofresh Sol… 2.96 0.01 0.0034 2.61e5 16 -0.0067 1.03
#> 5 AKO.B Embotell Andn… 14.5 -0.0150 -0.00104 1.46e5 9 -0.0236 0.503
#> 6 AKRO Akero Therape… 53.8 4.24 0.0855 1.05e6 23 -0.0024 6.16
#> 7 AMBC Ambac Financi… 17.1 0.44 0.0263 7.44e5 7 -0.0029 1.37
#> 8 ARDX Ardelyx Inc 2.53 0.02 0.008 3.52e7 14 -0.0524 4.16
#> 9 ARYD Arya Sciences… 10.1 -0.01 -0.0005 2.9 e3 25 -0.001 0.0402
#> 10 AURC Aurora Acquis… 10.1 0.05 0.0045 2.07e4 13 -0.0005 0.0302
#> # … with 44 more rows, 4 more variables: tradeTime <int>, symbolCode <chr>,
#> # symbolType <int>, hasOptions <lgl>, and abbreviated variable names
#> # ¹lastPrice, ²priceChange, ³percentChange, ⁴highHits1y, ⁵highPercent1y,
#> # ⁶lowPercent1y
I tried to use the cookie
dataframe to format my cookie in the exact same way, but for some reason it didn't like that.
Here's the file where you can copy the cookie string: