Search code examples
rweb-scrapingposthttr

R - httr POST request to website investing.com to get a JSON response


I try to scrape information from site investing.com, based on the Isin code of a stock.

When I fill the website top form with the Isin code, an xhr request is sent via a POST request. Here is the JSON content I get :

{"total":{"articles":10,"allResults":16,"quotes":6},"score":{"articles":25.00122},"articles":[...],
 "quotes":[
{"pairId":386,"name":"Accor SA","flag":"France","link":"\/equities\/accor","symbol":"ACCP","type":"Action - Paris","pair_type_raw":"Equities","pair_type":"equities","countryID":22,"sector":2,"region":6,"industry":55,"isCrypto":false,"exchange":"Paris","exchangeID":9},
{"pairId":948559,"name":"Accor SA","flag":"UK","link":"\/equities\/accor?cid=948559","symbol":"0H59","type":"Action - Londres","pair_type_raw":"Equities","pair_type":"equities","countryID":4,"sector":16,"region":6,"industry":129,"isCrypto":false,"exchange":"Londres","exchangeID":3},
{"pairId":33386,"name":"Accor SA","flag":"France","link":"\/equities\/accor?cid=33386","symbol":"ACp","type":"Action - BATS Europe","pair_type_raw":"Equities","pair_type":"equities","countryID":22,"sector":16,"region":6,"industry":129,"isCrypto":false,"exchange":"BATS Europe","exchangeID":121},
{"pairId":963294,"name":"Accor SA","flag":"Germany","link":"\/equities\/accor?cid=963294","symbol":"ACCP","type":"Action - Francfort","pair_type_raw":"Equities","pair_type":"equities","countryID":17,"sector":16,"region":6,"industry":129,"isCrypto":false,"exchange":"Francfort","exchangeID":104},
{"pairId":963914,"name":"Accor SA","flag":"Germany","link":"\/equities\/accor?cid=963914","symbol":"ACCP","type":"Action - TradeGate","pair_type_raw":"Equities","pair_type":"equities","countryID":17,"sector":0,"region":6,"industry":0,"isCrypto":false,"exchange":"TradeGate","exchangeID":105},
{"pairId":993697,"name":"Accor SA","flag":"Mexico","link":"\/equities\/accor?cid=993697","symbol":"ACCN","type":"Action - Mexico","pair_type_raw":"Equities","pair_type":"equities","countryID":7,"sector":16,"region":2,"industry":129,"isCrypto":false,"exchange":"Mexico","exchangeID":53}]}

I derived a POST request from the browser's inspection tools, to retrieve the JSON piece of information I need, not the whole page :

library(httr)
codeIsin <- 'FR0000120404'
    
investing_url <- list(scheme="https",
                      host="fr.investing.com",
                      filename="/search/service/searchTopBar")
      
investing_url <- modify_url(url="",
                            scheme=investing_url$scheme,
                            hostname=investing_url$host,
                            path=investing_url$filename)   
      
investing_query <- paste0("search_text=",codeIsin)
      
investing_headers <- list("Host" = "fr.investing.com",
                          "User-Agent" = "Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0",
                          "Accept" = "application/json, text/javascript, */*; q=0.01",
                          "Accept-Language" = "fr,fr-FR;q=0.8,en-US;q=0.5,en;q=0.3",
                          "Accept-Encoding" = "gzip, deflate, br",
                          "Content-Type" = "application/x-www-form-urlencoded",
                          "X-Requested-With" = "XMLHttpRequest",
                          "Content-Length" = "23",
                          "Origin" = "https://fr.investing.com",
                          "Connection" = "keep-alive",
                          "Pragma" = "no-cache",
                          "Cache-Control" = "no-cache",
                          "TE" = "Trailers"
                          )

      
response <- POST(url = investing_url,
                 query = investing_query,
                 header = investing_headers)

I get back a raw content :

typeof(response$content)
[1] "raw"

response$content
   [1] 20 3c 21 44 4f 43 54 59 50 45 20 48 54 4d 4c 3e 0a 3c 68 74 6d 6c 20 64 69 72 3d 22 6c 74 72 22 20
  [34] 78 6d 6c 6e 73 3d 22 68 74 74 70 3a 2f 2f 77 77 77 2e 77 33 2e 6f 72 67 2f 31 39 39 39 2f 78 68 74
...
 [958] 65 35 2a 64 29 3b 65 2b 3d 27 3b 65 78 70 69 72 65 73 3d 22 27 3b 65 2b 3d 6e 2e 74 6f 47 4d 54 53
 [991] 74 72 69 6e 67 28 29 3b 65 2b
 [ reached getOption("max.print") -- omitted 688441 entries ]

Once decoded with content(response, "text"), it appears to be the main page of the website.

response$request shows that all headers are not sent, specially "Content-Type" = "application/x-www-form-urlencoded" :

> response$request
<request>
POST https://fr.investing.com/search/service/searchTopBar?search_text=FR0000120404
Output: write_memory
Options:
* useragent: libcurl/7.74.0 r-curl/4.3 httr/1.4.2
* post: TRUE
* postfieldsize: 0
Headers:
* Accept: application/json, text/xml, application/xml, */*
* Content-Type:

Where does it get wrong in my request?


Solution

  • If you are not too tied to the syntax used, you can switch as follows, noting I have added a cookie header to allow for onward redirect within httr:

    library(httr)
    library(jsonlite)
    
    headers = c(
      'user-agent' = 'Safari/537.36',
      'x-requested-with' = 'XMLHttpRequest',
      'cookie' = 'adBlockerNewUserDomains=on')
    
    data = list(
      'search_text' = 'FR0000120404'
    )
    
    r <- httr::POST(url = 'https://fr.investing.com/search/service/searchTopBar', httr::add_headers(.headers=headers), 
                      body =data, encode = 'form') |> 
      content() |> 
      html_element('p') |> 
      html_text() |> 
      jsonlite::parse_json()
    
    r