Search code examples
rweb-scrapingrcurlhttr

Passing correct params to RCurl/postForm


I'm trying to download a pdf from the National Information Center via RCurl but I've been having some trouble. For this example URL, I want the pdf corresponding to the default settings, except for "Report Format" which should be "PDF". When I run the following script, it saves the file associated with selecting the other buttons ("Parent(s) of..."/HMDA -- not the default). I tried adding these input elements to params, but it didn't change anything. Could somebody help me identify the problem? thanks.

library(RCurl)
curl = getCurlHandle()
curlSetOpt(cookiejar = 'cookies.txt', curl = curl)
params = list(rbRptFormatPDF = 'rbRptFormatPDF')

url = 'https://www.ffiec.gov/nicpubweb/nicweb/OrgHierarchySearchForm.aspx?parID_RSSD=2162966&parDT_END=99991231'
html = getURL(url, curl = curl)
viewstate = sub('.*id="__VIEWSTATE" value="([0-9a-zA-Z+/=]*).*', '\\1', html)
event = sub('.*id="__EVENTVALIDATION" value="([0-9a-zA-Z+/=]*).*', '\\1', html)
params[['__VIEWSTATE']] = viewstate
params[['__EVENTVALIDATION']] = event
params[['btnSubmit']] = 'Submit'
result = postForm(url, .params=params, curl=curl, style='POST')

writeBin( as.vector(result), 'test.pdf')

Solution

  • Does this provide the correct PDF?

    library(httr)
    library(rvest)
    library(purrr)
    
    # setup inane sharepoint viewstate parameters
    res <- GET(url = "https://www.ffiec.gov/nicpubweb/nicweb/OrgHierarchySearchForm.aspx",
               query=list(parID_RSSD=2162966, parDT_END=99991231))
    
    # extract them
    pg <- content(res, as="parsed")
    hidden <- html_nodes(pg, xpath=".//form/input[@type='hidden']") 
    params <- setNames(as.list(xml_attr(hidden, "value")), xml_attr(hidden, "name"))
    
    # pile on more params
    params <- c(
      params, 
      grpInstitution = "rbCurInst", 
      lbTopHolders = "2961897", 
      grpHMDA = "rbNonHMDA", 
      lbTypeOfInstitution = "-99", 
      txtAsOfDate = "12/28/2016", 
      txtAsOfDateErrMsg = "", 
      lbHMDAYear = "2015", 
      grpRptFormat = "rbRptFormatPDF", 
      btnSubmit = "Submit"
    )
    
    # submit the req and save to disk
    POST(url = "https://www.ffiec.gov/nicpubweb/nicweb/OrgHierarchySearchForm.aspx",
         query=list(parID_RSSD=2162966, parDT_END=99991231),
         add_headers(Origin = "https://www.ffiec.gov"), 
         body = params, 
         encode = "form", 
         write_disk("/tmp/output.pdf")) -> res2