I've created a loop to scrape NBA regular season data. My loop cycles through all the regular season months over a set of years. My code keeps returning the error "Error in open.connection(x, "rb") : HTTP error 429." when the webpage does exist online and is accessible to everyone.
I've created a "try" variable to handle the exceptions when an NBA season was not in play in the months in my list. The loop should move past those months that games were not played and move on to the next one. My loop seems to work fine. I do notice that my memory usage report shows upwards of 95% of memory is in use when executing my loop. Could this be potential issue I need to address to execute my loop and create a table of NBA regular season data for my analysis period? Any help is greatly appreciated.
library(rvest)
library(dplyr)
mnths = c("october","november","december","january","february","march","april","may")
#list of months to cycle through
yrs = seq(2003,2017)
#list of years to cycle through
url_base = "https://www.basketball-reference.com/leagues/NBA_"
#beginning of webpage URL
#https://www.basketball-reference.com/leagues/NBA_2003_games-october.html
#above is the final webpage formatted for the 1st run of the loop as an example
i = 1
j = 1
while(i<=length(yrs)){
#begin loop to cycle through each year
while(j <= length(mnths)){
#begin subloop to cycle through each month in a year
webpage = paste(paste(paste(paste(url_base,yrs[i],sep = ""),"_games-",sep = ""),mnths[j],sep = ""),".html",sep = "")
#string variable of webpage with specific month and year in loop
webpageexists = try(read_html(webpage) %>% html_node(), silent = TRUE)
#try variable to check if webpage exists
if(webpageexists == "try-error"){
#if statement to check if webpage exists, if not variable "webpageexists" will be a try-error and the month will be incremented and subloop continues
j = j + 1
rm(webpageexists)
#removing try variable from memory
}else if(exists("tb")){
tbx = as.data.frame(read_html(webpage) %>% html_nodes("table") %>% html_table())
#table created to contain new data from specific webpage in loop
tb = rbind(tb,tbx)
#table holding all data from all runs of loop
j = j + 1
rm(webpageexists)
#removing try variable from memory
}else{
tb = as.data.frame(read_html(webpage) %>% html_nodes("table") %>% html_table())
#table that is created that all new tables will be merged into
#this else statement is only used on the very first run of the loop
j = j + 1
rm(webpageexists)
#removing try variable from memory
}
}
#end subloop to cycle through each month in a year
j = 1
#j reset to 1 so that the next year starts at the first month in the "mnths" list
i = i + 1
#i is incremented by 1 to move to the next year in the "yrs" list
}
#end loop to cycle through each year
HTTP error 429
429 states that you have been blocked for making too many requests.
From https://www.sports-reference.com/bot-traffic.html :
Unfortunately, non-human traffic, ie bots, crawlers, scrapers, can overwhelm our servers with the number of requests they send us in a short amount of time. Therefore we are implementing rate limiting on the site. We will attempt to keep this page up to date with our current settings.
Currently we will block users sending requests to:
- FBref and Stathead sites more often than ten requests in a minute.
- our other sites more often than twenty requests in a minute.
- This is regardless of bot type and construction and pages accessed.
- If you violate this rule your session will be in jail for up to a day.
So you'd might want to think about limiting request rate. And instead of guessing monthly schedule URLs, you could also collect valid ones first.
For rate limiting you could use {polite}
package, purrr::slowly()
, throttling with httr2
or just plain Sys.sleep()
, for example.
Here I'm collecting monthly urls with slowed down read_html()
( slowly(read_html)()
):
library(rvest)
library(dplyr)
library(stringr)
library(purrr)
library(httr2)
yrs <- 2003:2017
season_templ <- "https://www.basketball-reference.com/leagues/NBA_{y}_games.html"
schedule_urls <-
yrs |>
set_names() |>
map(\(y) str_glue(season_templ)) |>
# create (and call) slow version of read_html, no more than 20 requests/min, 3s delay between requests
map(\(url_) slowly(read_html, rate = rate_delay(pause = 60/20))(url_) |> html_elements("div.filter a") |> html_attr("href"))
head(schedule_urls) |> str()
#> List of 6
#> $ 2003: chr [1:9] "/leagues/NBA_2003_games-october.html" "/leagues/NBA_2003_games-november.html" "/leagues/NBA_2003_games-december.html" "/leagues/NBA_2003_games-january.html" ...
#> $ 2004: chr [1:9] "/leagues/NBA_2004_games-october.html" "/leagues/NBA_2004_games-november.html" "/leagues/NBA_2004_games-december.html" "/leagues/NBA_2004_games-january.html" ...
#> $ 2005: chr [1:8] "/leagues/NBA_2005_games-november.html" "/leagues/NBA_2005_games-december.html" "/leagues/NBA_2005_games-january.html" "/leagues/NBA_2005_games-february.html" ...
#> $ 2006: chr [1:8] "/leagues/NBA_2006_games-november.html" "/leagues/NBA_2006_games-december.html" "/leagues/NBA_2006_games-january.html" "/leagues/NBA_2006_games-february.html" ...
#> $ 2007: chr [1:9] "/leagues/NBA_2007_games-october.html" "/leagues/NBA_2007_games-november.html" "/leagues/NBA_2007_games-december.html" "/leagues/NBA_2007_games-january.html" ...
#> $ 2008: chr [1:9] "/leagues/NBA_2008_games-october.html" "/leagues/NBA_2008_games-november.html" "/leagues/NBA_2008_games-december.html" "/leagues/NBA_2008_games-january.html" ...
And then fetching all tables in a httr2
-based pipeline where request rate is controlled by req_throttle()
:
# create base request, set rate to 20 requests per minute to
# comply with https://www.sports-reference.com/bot-traffic.html
req_base <-
request("https://www.basketball-reference.com/") |>
req_throttle(20/60)
results <-
schedule_urls |>
unlist() |>
# prepare list of requests
map(req_url_path, req = req_base) |>
# perform all prepared requests sequentially
req_perform_sequential() |>
# filter successes
resps_successes() |>
# extract tables from all monthly schedule pages, bind to a single tibble
resps_data(\(resp) resp_body_html(resp) |> html_element("table") |> html_table(convert = FALSE))
Resulting frame:
results
#> # A tibble: 19,402 × 11
#> Date `Start (ET)` `Visitor/Neutral` PTS `Home/Neutral` PTS `` ``
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 Tue, O… 7:30p Philadelphia 76e… 88 Orlando Magic 95 Box … ""
#> 2 Tue, O… 10:00p Cleveland Cavali… 67 Sacramento Ki… 94 Box … ""
#> 3 Tue, O… 10:30p San Antonio Spurs 87 Los Angeles L… 82 Box … ""
#> 4 Wed, O… 7:00p Chicago Bulls 99 Boston Celtics 96 Box … ""
#> 5 Wed, O… 7:00p Washington Wizar… 68 Toronto Rapto… 74 Box … ""
#> 6 Wed, O… 7:00p Milwaukee Bucks 93 Philadelphia … 95 Box … ""
#> 7 Wed, O… 7:00p Houston Rockets 82 Indiana Pacers 91 Box … ""
#> 8 Wed, O… 7:30p Atlanta Hawks 94 New Jersey Ne… 105 Box … ""
#> 9 Wed, O… 7:30p Orlando Magic 100 Miami Heat 86 Box … ""
#> 10 Wed, O… 8:00p Denver Nuggets 77 Minnesota Tim… 83 Box … ""
#> # ℹ 19,392 more rows
#> # ℹ 3 more variables: Attend. <chr>, Arena <chr>, Notes <chr>