I'm working on a project of collecting some datas from https://www.hockey-reference.com/boxscores/. Actually I'me trying to get every table of a season. I've generated a list of urls composed by combining https://www.hockey-reference.com/boxscores/ with each date of the calendar and each team name like "https://www.hockey-reference.com/boxscores/20171005WSH.html
I've stocked every url into a list but some are leading to a 404 error. I'm trying to use the "Curl package" with the function "url.exists" to know if there will be a 404 error and delete the url of the list. The problem is that each url from the list (including really existing url) return FALSE with url.exists in a for loop... I've tried to use this function in the console with url.exists(my list[i]) but it returns FALSE.
here's my code:
library(rvest)
library(RCurl)
##### Variables ####
team_names = c("ANA","ARI","BOS","BUF","CAR","CGY","CHI","CBJ","COL","DAL","DET","EDM","FLA","LAK","MIN","MTL","NSH","NJD","NYI","NYR","OTT","PHI","PHX","PIT","SJS","STL","TBL","TOR","VAN","VGK","WPG","WSH")
S2017 = read.table(file = "2018_season", header = TRUE, sep = ",")
dates = as.character(S2017[,1])
#### formating des dates ####
for (i in 1:length(dates)) {
dates[i] = gsub("-", "", dates[i])
}
dates = unique(dates)
##### generation des url ####
url_list = c()
for (j in 1:2) { #dates
for (k in 1:length(team_names)) {
print(k)
url_site = paste("https://www.hockey-reference.com/boxscores/",dates[j],team_names[k],".html",sep="")
url_list = rbind(url_site,url_list)
}
}
url_list_raffined = c()
for (l in 1:40) {
print(l)
if (url.exists(url_list[l], .header = TRUE) == TRUE) {
url_list_raffined = c(url_list_raffined,url_list[l])
}
}
Any idea for my problems ?
thanks
Instead of RCurl
, you could use the httr
package:
library(httr)
library(rvest)
library(xml2)
resp <- httr::GET(url_address, httr::timeout(60))
if(resp$status_code==200) {
html <- xml2::read_html(resp)
txt <- rvest::html_text(rvest::html_nodes(html)) # or similar
# save the results somewhere or do your operations..
}
here url_address
is the address you are trying to download. Maybe you need to put this in a function or loop to iterate over all your addresses.