Search code examples
rhttp-posthttr

How to structure httr POST request to return site data?


I'm having trouble pulling data from the following website. If I go to the long_url via my browser I can see the table I want to scrape, but if I call the url from within R using httr, I'm either not getting the data returned to me, or I'm not understanding how it's being returned to me.

base_url <- "http://web1.ncaa.org/stats/exec/records"
long_url <- "http://web1.ncaa.org/stats/exec/records?academicYear=2014&sportCode=MFB&orgId=721"

library(XML)
library(httr)
library(rvest) # devtools::install_github("hadley/rvest")

The results of these POST requests look identical to me,

doc <- POST(base_url, query = list(academicYear = "2014", sportCode = "MFB",
                                         orgId = "721"))
doc <- POST(long_url)

class(doc)

Both POST requests return a status code of 200, and the class of doc is "HTMLInternalDocument" and "XMLInternalDocument" which is the normal R object that allows me to scrape the pages. But then the following rvest and XML functions come up empty, even though I know there is a table at the url.

 table <- html_nodes(doc, css = "td")
 table <- readHTMLTable(doc)

Could someone help explain to me what my httr request is missing? I also tried a GET request with no luck.


Solution

  • What you are encountering here is actually a pretty common problem. httr uses RCurl for the heavy lifting. The default user_agent header sent in a GET or POST request by RCurl is NULL, which frequently confuses scripts. This is why you get different results from your browser and httr(...). If you spoof a meaningful user agent, you get the results you want.

    base_url <- "http://web1.ncaa.org/stats/exec/records"
    ua       <- "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:33.0) Gecko/20100101 Firefox/33.0"
    library(httr)
    library(XML)
    doc <- POST(base_url, 
                query = list(academicYear = "2014", sportCode = "MFB",orgId = "721"),
                user_agent(ua))
    
    html <- content(doc, useInternalNodes=T)
    df.list <- readHTMLTable(html)
    df      <- df.list[[4]]
    head(df)
    #    Opponent  Game Date Air ForceScore OppScore  Loc Neutral SiteLocation GameLength Attend
    # 1   Colgate 08/31/2013             38       13 Home                               - 32,095
    # 2  Utah St. 09/07/2013             20       52 Home                               - 32,716
    # 3 Boise St. 09/13/2013             20       42 Away                               - 36,069
    # 4   Wyoming 09/21/2013             23       56 Home                               - 35,389
    # 5    Nevada 09/28/2013             42       45 Away                               - 24,545
    # 6      Navy 10/05/2013             10       28 Away                               - 38,225
    

    Note also that this website uses tables for just about everything, so readHTMLTable(...) actually returns a list of 4 data frames. The 4th is the one you want.

    You don't need rvest.