Search code examples
rpostweb-scrapingrvestrcurl

R - form web scraping with rvest


First I'd like to take a moment and thank the SO community, You helped me many times in the past without me needing to even create an account.

My current problem involves web scraping with R. Not my strong point.

I would like to scrap http://www.cbs.dtu.dk/services/SignalP/

what I have tried:

    library(rvest)
    url <- "http://www.cbs.dtu.dk/services/SignalP/"
    seq <- "MTSKTCLVFFFSSLILTNFALAQDRAPHGLAYETPVAFSPSAFDFFHTQPENPDPTFNPCSESGCSPLPVAAKVQGASAKAQESDIVSISTGTRSGIEEHGVVGIIFGLAFAVMM"

    session <- rvest::html_session(url)
    form <- rvest::html_form(session)[[2]]
    form <- rvest::set_values(form, `SEQPASTE` = seq)
    form_res_cbs <- rvest::submit_form(session, form)
    #rvest prints out:
    Submitting with 'trunc'

rvest::html_text(rvest::html_nodes(form_res_cbs, "head")) 
#ouput:
"Configuration error"

rvest::html_text(rvest::html_nodes(form_res_cbs, "body"))

#ouput:
"Exception:WebfaceConfigErrorPackage:Webface::service : 358Message:Unhandled #parameter 'NULL' in form "

I am unsure what is the unhandled parameter. Is the problem in the submit button? I can not seem to force:

form_res_cbs <- rvest::submit_form(session, form, submit = "submit")
#rvest prints out
Error: Unknown submission name 'submit'.
Possible values: trunc

is the problem the submit$name is NULL?

form[["fields"]][[23]] 

I tried defining the fake submit button as suggested here: Submit form with no submit button in rvest

with no luck.

I am open to solutions using rvest or RCurl/httr, I would like to avoid using RSelenium

EDIT: thanks to hrbrmstr awesome answer I was able to build a function for this task. It is available in the package ragp: https://github.com/missuse/ragp


Solution

  • Well, this is doable. But it's going to require elbow grease.

    This part:

    library(rvest)
    library(httr)
    library(tidyverse)
    
    POST(
      url = "http://www.cbs.dtu.dk/cgi-bin/webface2.fcgi",
      encode = "form",
      body=list(
        `configfile` = "/usr/opt/www/pub/CBS/services/SignalP-4.1/SignalP.cf",
        `SEQPASTE` = "MTSKTCLVFFFSSLILTNFALAQDRAPHGLAYETPVAFSPSAFDFFHTQPENPDPTFNPCSESGCSPLPVAAKVQGASAKAQESDIVSISTGTRSGIEEHGVVGIIFGLAFAVMM",
        `orgtype` = "euk",
        `Dcut-type` = "default",
        `Dcut-noTM` = "0.45",
        `Dcut-TM` = "0.50",
        `graphmode` = "png",
        `format` = "summary",
        `minlen` = "",
        `method` = "best",
        `trunc` = ""
      ),
      verbose()
    ) -> res
    

    Makes the request you made. I left verbose() in so you can watch what happens. It's missing the "filename" field, but you specified the string, so it's a good mimic of what you did.

    Now, the tricky part is that it uses an intermediary redirect page that gives you a chance to enter an e-mail address for notification when the query is done. It does do a regular (every ~10s or so) check to see if the query is finished and will redirect quickly if so.

    That page has the query id which can be extracted via:

    content(res, as="parsed") %>% 
      html_nodes("input[name='jobid']") %>% 
      html_attr("value") -> jobid
    

    Now, we can mimic the final request, but I'd add in a Sys.sleep(20) before doing so to ensure the report is done.

    GET(
      url = "http://www.cbs.dtu.dk/cgi-bin/webface2.fcgi",
      query = list(
        jobid = jobid,
        wait = "20"
      ),
      verbose()
    ) -> res2
    

    That grabs the final results page:

    html_print(HTML(content(res2, as="text")))
    

    enter image description here

    You can see images are missing because GET only retrieves the HTML content. You can use functions from rvest/xml2 to parse through the page and scrape out the tables and the URLs that you can then use to get new content.

    To do all this, I used burpsuite to intercept a browser session and then my burrp R package to inspect the results. You can also visually inspect in burpsuite and build things more manually.