I try to scrape information from site investing.com, based on the Isin code of a stock.
When I fill the website top form with the Isin code, an xhr request is sent via a POST request. Here is the JSON content I get :
{"total":{"articles":10,"allResults":16,"quotes":6},"score":{"articles":25.00122},"articles":[...],
"quotes":[
{"pairId":386,"name":"Accor SA","flag":"France","link":"\/equities\/accor","symbol":"ACCP","type":"Action - Paris","pair_type_raw":"Equities","pair_type":"equities","countryID":22,"sector":2,"region":6,"industry":55,"isCrypto":false,"exchange":"Paris","exchangeID":9},
{"pairId":948559,"name":"Accor SA","flag":"UK","link":"\/equities\/accor?cid=948559","symbol":"0H59","type":"Action - Londres","pair_type_raw":"Equities","pair_type":"equities","countryID":4,"sector":16,"region":6,"industry":129,"isCrypto":false,"exchange":"Londres","exchangeID":3},
{"pairId":33386,"name":"Accor SA","flag":"France","link":"\/equities\/accor?cid=33386","symbol":"ACp","type":"Action - BATS Europe","pair_type_raw":"Equities","pair_type":"equities","countryID":22,"sector":16,"region":6,"industry":129,"isCrypto":false,"exchange":"BATS Europe","exchangeID":121},
{"pairId":963294,"name":"Accor SA","flag":"Germany","link":"\/equities\/accor?cid=963294","symbol":"ACCP","type":"Action - Francfort","pair_type_raw":"Equities","pair_type":"equities","countryID":17,"sector":16,"region":6,"industry":129,"isCrypto":false,"exchange":"Francfort","exchangeID":104},
{"pairId":963914,"name":"Accor SA","flag":"Germany","link":"\/equities\/accor?cid=963914","symbol":"ACCP","type":"Action - TradeGate","pair_type_raw":"Equities","pair_type":"equities","countryID":17,"sector":0,"region":6,"industry":0,"isCrypto":false,"exchange":"TradeGate","exchangeID":105},
{"pairId":993697,"name":"Accor SA","flag":"Mexico","link":"\/equities\/accor?cid=993697","symbol":"ACCN","type":"Action - Mexico","pair_type_raw":"Equities","pair_type":"equities","countryID":7,"sector":16,"region":2,"industry":129,"isCrypto":false,"exchange":"Mexico","exchangeID":53}]}
I derived a POST request from the browser's inspection tools, to retrieve the JSON piece of information I need, not the whole page :
library(httr)
codeIsin <- 'FR0000120404'
investing_url <- list(scheme="https",
host="fr.investing.com",
filename="/search/service/searchTopBar")
investing_url <- modify_url(url="",
scheme=investing_url$scheme,
hostname=investing_url$host,
path=investing_url$filename)
investing_query <- paste0("search_text=",codeIsin)
investing_headers <- list("Host" = "fr.investing.com",
"User-Agent" = "Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0",
"Accept" = "application/json, text/javascript, */*; q=0.01",
"Accept-Language" = "fr,fr-FR;q=0.8,en-US;q=0.5,en;q=0.3",
"Accept-Encoding" = "gzip, deflate, br",
"Content-Type" = "application/x-www-form-urlencoded",
"X-Requested-With" = "XMLHttpRequest",
"Content-Length" = "23",
"Origin" = "https://fr.investing.com",
"Connection" = "keep-alive",
"Pragma" = "no-cache",
"Cache-Control" = "no-cache",
"TE" = "Trailers"
)
response <- POST(url = investing_url,
query = investing_query,
header = investing_headers)
I get back a raw content :
typeof(response$content)
[1] "raw"
response$content
[1] 20 3c 21 44 4f 43 54 59 50 45 20 48 54 4d 4c 3e 0a 3c 68 74 6d 6c 20 64 69 72 3d 22 6c 74 72 22 20
[34] 78 6d 6c 6e 73 3d 22 68 74 74 70 3a 2f 2f 77 77 77 2e 77 33 2e 6f 72 67 2f 31 39 39 39 2f 78 68 74
...
[958] 65 35 2a 64 29 3b 65 2b 3d 27 3b 65 78 70 69 72 65 73 3d 22 27 3b 65 2b 3d 6e 2e 74 6f 47 4d 54 53
[991] 74 72 69 6e 67 28 29 3b 65 2b
[ reached getOption("max.print") -- omitted 688441 entries ]
Once decoded with content(response, "text")
, it appears to be the main page of the website.
response$request
shows that all headers are not sent, specially "Content-Type" = "application/x-www-form-urlencoded"
:
> response$request
<request>
POST https://fr.investing.com/search/service/searchTopBar?search_text=FR0000120404
Output: write_memory
Options:
* useragent: libcurl/7.74.0 r-curl/4.3 httr/1.4.2
* post: TRUE
* postfieldsize: 0
Headers:
* Accept: application/json, text/xml, application/xml, */*
* Content-Type:
Where does it get wrong in my request?
If you are not too tied to the syntax used, you can switch as follows, noting I have added a cookie header to allow for onward redirect within httr:
library(httr)
library(jsonlite)
headers = c(
'user-agent' = 'Safari/537.36',
'x-requested-with' = 'XMLHttpRequest',
'cookie' = 'adBlockerNewUserDomains=on')
data = list(
'search_text' = 'FR0000120404'
)
r <- httr::POST(url = 'https://fr.investing.com/search/service/searchTopBar', httr::add_headers(.headers=headers),
body =data, encode = 'form') |>
content() |>
html_element('p') |>
html_text() |>
jsonlite::parse_json()
r