Search code examples
rcurlrcurlgeturlhttr

'RCurl' [R] package getURL webpage error when scraping API


I am trying to scrape data on pages from an API using the getURL function of the RCurl package in R. My problem is that I can't replicate the response that I get when I open the URL in Chrome when I make the request using R. Essentially, when I open the API page (url below) in Chrome it works fine but if I request it in using getURL in R (or using incognito mode in Chrome) I get a '500 Internal Server Error' response and not the pretty JSON that I'm looking for.

URL/API in question: http://www.bluenile.com/api/public/loose-diamond/diamond-details/panel?country=USA&currency=USD&language=en-us&productSet=BN&sku=LD04077082

Here is my (failed) request in [R].

test2 <- fromJSON(getURL("http://www.bluenile.com/api/public/loose-diamond/diamond-details/panel?country=USA&currency=USD&language=en-us&productSet=BN&sku=LD04077082", ssl.verifypeer = FALSE, useragent = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.124 Safari/537.36"))

My Research so Far First I looked at this prior question on stack and added in my useragent to the request (did not solve problem but may still be necessary): ViralHeat API issues with getURL() command in RCurl package

Next I looked at this helpful post which guides my rationale: R Disparity between browser and GET / getURL

My Ideas About the Solution This is not my area of expertise but my guess is that the request is lacking a cookie needed to complete the request (hence why it doesn't work in my browser in incognito mode). I compared the requests and responses from the successful request to the unsuccessful request:

Successful request: enter image description here

Unsuccessful request:

enter image description here

Anyone have any ideas? Should I try using the package RSelenium package that was suggested by MrFlick in the 2nd post I made.


Solution

  • This is a courteous site. It would like to know where you come from what currency you use etc. to give you a better user experience. It does this by setting a multitude of cookies on the landing page. So we follow suit and navigate to the landing page first getting the cookies then we goto the page we want:

    library(RCurl)
    myURL <- "http://www.bluenile.com/api/public/loose-diamond/diamond-details/panel?country=USA&currency=USD&language=en-us&productSet=BN&sku=LD04077082"
    agent="Mozilla/5.0 (Windows NT 6.3; WOW64; rv:32.0) Gecko/20100101 Firefox/32.0"
    
    #Set RCurl pars
    curl = getCurlHandle()
    curlSetOpt(cookiejar="cookies.txt",  useragent = agent, followlocation = TRUE, curl=curl)
    firstPage <- getURL("http://www.bluenile.com", curl=curl)
    myPage <- getURL(myURL, curl = curl)
    
    library(RJSONIO)
    > names(fromJSON(myPage))
    [1] "diamondDetailsHeader" "diamondDetailsBodies" "pageMetadata"         "expandedUrl"         
    [5] "newVersion"           "multiDiamond"  
    

    and the cookies:

    > getCurlInfo(curl)$cookielist
     [1] ".bluenile.com\tTRUE\t/\tFALSE\t2412270275\tGUID\tDA5C11F5_E468_46B5_B4E8_D551D4D6EA4D"                                                                    
     [2] ".bluenile.com\tTRUE\t/\tFALSE\t1475342275\tsplit\tver~3&presetFilters~TEST"                                                                               
     [3] ".bluenile.com\tTRUE\t/\tFALSE\t1727630275\tsitetrack\tver~2&jse~0"                                                                                        
     [4] ".bluenile.com\tTRUE\t/\tFALSE\t1425230275\tpop\tver~2&china~false&french~false&ie~false&internationalSelect~false&iphoneApp~false&survey~false&uae~false" 
     [5] ".bluenile.com\tTRUE\t/\tFALSE\t1475342275\tdsearch\tver~6&newUser~true"                                                                                   
     [6] ".bluenile.com\tTRUE\t/\tFALSE\t1443806275\tlocale\tver~1&country~IRL&currency~EUR&language~en-gb&productSet~BNUK"                                         
     [7] ".bluenile.com\tTRUE\t/\tFALSE\t0\tbnses\tver~1&ace~false&isbml~false&fbcs~false&ss~0&mbpop~false&sswpu~false&deo~false"                                   
     [8] ".bluenile.com\tTRUE\t/\tFALSE\t1727630275\tbnper\tver~5&NIB~0&DM~-&GUID~DA5C11F5_E468_46B5_B4E8_D551D4D6EA4D&SESS-CT~1&STC~32RPVK&FB_MINI~false&SUB~false"
     [9] "#HttpOnly_www.bluenile.com\tFALSE\t/\tFALSE\t0\tJSESSIONID\tB8475C3AEC08205E5AC6252C94E4B858"                                                             
    [10] ".bluenile.com\tTRUE\t/\tFALSE\t1727630278\tmigrationstatus\tver~1&redirected~false"