Search code examples
javascriptasp.netrrcurlhttr

How to download a file behind a semi-broken javascript asp function with R


I am trying to fix a download automation script that I provide publicly so that anyone can easily download the world values survey with R.

On this web page - http://www.worldvaluessurvey.org/WVSDocumentationWV4.jsp - the PDF link "WVS_2000_Questionnaire_Root" easily downloads in firefox and chrome.I cannot figure out how to automate the download with httr or RCurl or any other R package. screenshot below of the chrome internet behavior. That PDF link needs to follow through to the ultimate source of http://www.worldvaluessurvey.org/wvsdc/DC00012/F00001316-WVS_2000_Questionnaire_Root.pdf but if you click their directly, there's a connectivity error. i am unclear if this is related to the request header Upgrade-Insecure-Requests:1 or the response header status code 302

Clicking around the new worldvaluessurvey.org website with chrome's inspect element windows open makes me think there were some hacky coding decisions made here, hence the title semi-broken :/

enter image description here


Solution

  • Using the excellent curlconverter to mimic the browser you can directly request the pdf.

    First we mimic the browser initial GET request (may not be necessary a simple GET and keeping the cookie may suffice):

    library(curlconverter)
    library(httr)
    browserGET <- "curl 'http://www.worldvaluessurvey.org/WVSDocumentationWV4.jsp' -H 'Host: www.worldvaluessurvey.org' -H 'User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:49.0) Gecko/20100101 Firefox/49.0' -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8' -H 'Accept-Language: en-US,en;q=0.5' --compressed -H 'Connection: keep-alive' -H 'Upgrade-Insecure-Requests: 1'"
    getDATA <- (straighten(browserGET) %>% make_req)[[1]]()
    

    The JSESSIONID cookie is available at getDATA$cookies$value

    getPDF <- "curl 'http://www.worldvaluessurvey.org/wvsdc/DC00012/F00001316-WVS_2000_Questionnaire_Root.pdf' -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8' -H 'Accept-Encoding: gzip, deflate' -H 'Accept-Language: en-US,en;q=0.5' -H 'Connection: keep-alive' -H 'Cookie: JSESSIONID=59558DE631D107B61F528C952FC6E21F' -H 'Host: www.worldvaluessurvey.org' -H 'Referer: http://www.worldvaluessurvey.org/AJDocumentationSmpl.jsp' -H 'Upgrade-Insecure-Requests: 1' -H 'User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:49.0) Gecko/20100101 Firefox/49.0'"
    appIP <- straighten(getPDF)
    # replace cookie
    appIP[[1]]$cookies$JSESSIONID <- getDATA$cookies$value
    appReq <- make_req(appIP)
    response <- appReq[[1]]()
    writeBin(response$content, "test.pdf")
    

    The curl strings were plucked straight from the browser and curlconverter then does all the work.