Search code examples
htmlrpostrcurlrvest

How can I POST a simple HTML form in R?


I'm relatively new to R programming and I'm trying to put some of the stuff I'm learning in the Johns Hopkins Data Science track to practical use. Specifically, I would like to automate the process of downloading historical bond prices from the US Treasury website

Using both Firefox and R, I was able to determine that the US Treasury website uses a very simple HTML POST form to specify a single date for the quotes of interest. It then returns a table of secondary market information for all outstanding bonds.

I have unsuccessfully tried to use two different R packages to submit a request to the US Treasury web server. Hare are the two approaches I tried:

Attempt #1 (using RCurl):

url <- "https://www.treasurydirect.gov/GA-FI/FedInvest/selectSecurityPriceDate.htm"
td.html <- postForm(url,
                    submit = "Show Prices",
                    priceDate.year  = 2014,
                    priceDate.month = 12,
                    priceDate.day   = 15,
                   .opts = curlOptions(ssl.verifypeer = FALSE))

This results in a web page being returned and stored in td.html but all it contains is an error message from the treasurydirect server. I know the server is working because when I submit the same request via my browser, I get the expected results.

Attempt #2 (using rvest):

s <- html_session(url)
f0 <- html_form(s)
f1 <- set_values(f0[[2]], priceDate.year=2014, priceDate.month=12, priceDate.day=15)
test <- submit_form(s, f1)

Unfortunately, this approach doesn't even leave R and results in the following error message from R:

Submitting with 'submit'
Error in function (type, msg, asError = TRUE)  : <url> malformed

I can't seem to figure out how to see what "malformed" text is being sent to rvest so that I can try to diagnose the problem.

Any suggestions or tips to solving this seeming simple task would be greatly appreciated!


Solution

  • Well, it appears to work with the httr library.

    library(httr)
    
    url <- "https://www.treasurydirect.gov/GA-FI/FedInvest/selectSecurityPriceDate.htm"
    
    fd <- list(
        submit = "Show Prices",
        priceDate.year  = 2014,
        priceDate.month = 12,
        priceDate.day   = 15
    )
    
    resp<-POST(url, body=fd, encode="form")
    content(resp)
    

    The rvest library is really just a wrapper to httr. It looks like it doesn't do a good job of interpreting absolute URLs without the server name. So if you look at

    f1$url
    # [1] /GA-FI/FedInvest/selectSecurityPriceDate.htm
    

    you see that it just has the path and not the server name. This appears to be confusing httr. If you do

    f1 <- set_values(f0[[2]], priceDate.year=2014, priceDate.month=12, priceDate.day=15)
    f1$url <- url
    test <- submit_form(s, f1)
    

    that seems to work. Perhaps it's a bug that should be reported to rvest. (Tested on rvest_0.1.0)