I'm having trouble pulling data from the following website. If I go to the long_url via my browser I can see the table I want to scrape, but if I call the url from within R using httr, I'm either not getting the data returned to me, or I'm not understanding how it's being returned to me.
base_url <- "http://web1.ncaa.org/stats/exec/records"
long_url <- "http://web1.ncaa.org/stats/exec/records?academicYear=2014&sportCode=MFB&orgId=721"
library(XML)
library(httr)
library(rvest) # devtools::install_github("hadley/rvest")
The results of these POST requests look identical to me,
doc <- POST(base_url, query = list(academicYear = "2014", sportCode = "MFB",
orgId = "721"))
doc <- POST(long_url)
class(doc)
Both POST requests return a status code of 200, and the class of doc is "HTMLInternalDocument" and "XMLInternalDocument" which is the normal R object that allows me to scrape the pages. But then the following rvest and XML functions come up empty, even though I know there is a table at the url.
table <- html_nodes(doc, css = "td")
table <- readHTMLTable(doc)
Could someone help explain to me what my httr request is missing? I also tried a GET request with no luck.
What you are encountering here is actually a pretty common problem. httr
uses RCurl
for the heavy lifting. The default user_agent header sent in a GET or POST request by RCurl
is NULL
, which frequently confuses scripts. This is why you get different results from your browser and httr(...)
. If you spoof a meaningful user agent, you get the results you want.
base_url <- "http://web1.ncaa.org/stats/exec/records"
ua <- "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:33.0) Gecko/20100101 Firefox/33.0"
library(httr)
library(XML)
doc <- POST(base_url,
query = list(academicYear = "2014", sportCode = "MFB",orgId = "721"),
user_agent(ua))
html <- content(doc, useInternalNodes=T)
df.list <- readHTMLTable(html)
df <- df.list[[4]]
head(df)
# Opponent Game Date Air ForceScore OppScore Loc Neutral SiteLocation GameLength Attend
# 1 Colgate 08/31/2013 38 13 Home - 32,095
# 2 Utah St. 09/07/2013 20 52 Home - 32,716
# 3 Boise St. 09/13/2013 20 42 Away - 36,069
# 4 Wyoming 09/21/2013 23 56 Home - 35,389
# 5 Nevada 09/28/2013 42 45 Away - 24,545
# 6 Navy 10/05/2013 10 28 Away - 38,225
Note also that this website uses tables for just about everything, so readHTMLTable(...)
actually returns a list of 4 data frames. The 4th is the one you want.
You don't need rvest
.