r web-scraping memory-management while-loop rvest

Web-scraping in R using while loops Error in open.connection(x, "rb") : HTTP error 429 when webpage exists

I've created a loop to scrape NBA regular season data. My loop cycles through all the regular season months over a set of years. My code keeps returning the error "Error in open.connection(x, "rb") : HTTP error 429." when the webpage does exist online and is accessible to everyone.

I've created a "try" variable to handle the exceptions when an NBA season was not in play in the months in my list. The loop should move past those months that games were not played and move on to the next one. My loop seems to work fine. I do notice that my memory usage report shows upwards of 95% of memory is in use when executing my loop. Could this be potential issue I need to address to execute my loop and create a table of NBA regular season data for my analysis period? Any help is greatly appreciated.

library(rvest)
library(dplyr)
mnths = c("october","november","december","january","february","march","april","may")
#list of months to cycle through
yrs = seq(2003,2017)
#list of years to cycle through
url_base = "https://www.basketball-reference.com/leagues/NBA_"
#beginning of webpage URL
#https://www.basketball-reference.com/leagues/NBA_2003_games-october.html
#above is the final webpage formatted for the 1st run of the loop as an example
i = 1
j = 1

while(i<=length(yrs)){
#begin loop to cycle through each year

  while(j <= length(mnths)){
  #begin subloop to cycle through each month in a year

    webpage = paste(paste(paste(paste(url_base,yrs[i],sep = ""),"_games-",sep = ""),mnths[j],sep = ""),".html",sep = "")
    #string variable of webpage with specific month and year in loop
    webpageexists = try(read_html(webpage) %>% html_node(), silent = TRUE)
    #try variable to check if webpage exists

    if(webpageexists == "try-error"){
    #if statement to check if webpage exists, if not variable "webpageexists" will be a try-error and the month will be incremented and subloop continues
      j = j + 1
      rm(webpageexists)
      #removing try variable from memory
    }else if(exists("tb")){
      tbx = as.data.frame(read_html(webpage) %>% html_nodes("table") %>% html_table())
      #table created to contain new data from specific webpage in loop
      tb = rbind(tb,tbx)
      #table holding all data from all runs of loop
      j = j + 1
      rm(webpageexists)
      #removing try variable from memory
    }else{
      tb = as.data.frame(read_html(webpage) %>% html_nodes("table") %>% html_table())
      #table that is created that all new tables will be merged into
      #this else statement is only used on the very first run of the loop
      j = j + 1
      rm(webpageexists)
      #removing try variable from memory
    }
    
  }
  #end subloop to cycle through each month in a year

  j = 1
  #j reset to 1 so that the next year starts at the first month in the "mnths" list
  i = i + 1
  #i is incremented by 1 to move to the next year in the "yrs" list
}
#end loop to cycle through each year

Solution

HTTP error 429

429 states that you have been blocked for making too many requests.

From https://www.sports-reference.com/bot-traffic.html :

Unfortunately, non-human traffic, ie bots, crawlers, scrapers, can overwhelm our servers with the number of requests they send us in a short amount of time. Therefore we are implementing rate limiting on the site. We will attempt to keep this page up to date with our current settings.

Currently we will block users sending requests to:

FBref and Stathead sites more often than ten requests in a minute.

our other sites more often than twenty requests in a minute.

This is regardless of bot type and construction and pages accessed.

If you violate this rule your session will be in jail for up to a day.

So you'd might want to think about limiting request rate. And instead of guessing monthly schedule URLs, you could also collect valid ones first.

For rate limiting you could use {polite} package, purrr::slowly(), throttling with httr2 or just plain Sys.sleep(), for example.

Here I'm collecting monthly urls with slowed down read_html() ( slowly(read_html)() ):

library(rvest)
library(dplyr)
library(stringr)
library(purrr)
library(httr2)

yrs <- 2003:2017
season_templ <- "https://www.basketball-reference.com/leagues/NBA_{y}_games.html"

schedule_urls <- 
  yrs |>
  set_names() |>
  map(\(y) str_glue(season_templ)) |>
  # create (and call) slow version of read_html, no more than 20 requests/min, 3s delay between requests
  map(\(url_) slowly(read_html, rate = rate_delay(pause = 60/20))(url_) |> html_elements("div.filter a") |> html_attr("href"))

head(schedule_urls) |> str()
#> List of 6
#>  $ 2003: chr [1:9] "/leagues/NBA_2003_games-october.html" "/leagues/NBA_2003_games-november.html" "/leagues/NBA_2003_games-december.html" "/leagues/NBA_2003_games-january.html" ...
#>  $ 2004: chr [1:9] "/leagues/NBA_2004_games-october.html" "/leagues/NBA_2004_games-november.html" "/leagues/NBA_2004_games-december.html" "/leagues/NBA_2004_games-january.html" ...
#>  $ 2005: chr [1:8] "/leagues/NBA_2005_games-november.html" "/leagues/NBA_2005_games-december.html" "/leagues/NBA_2005_games-january.html" "/leagues/NBA_2005_games-february.html" ...
#>  $ 2006: chr [1:8] "/leagues/NBA_2006_games-november.html" "/leagues/NBA_2006_games-december.html" "/leagues/NBA_2006_games-january.html" "/leagues/NBA_2006_games-february.html" ...
#>  $ 2007: chr [1:9] "/leagues/NBA_2007_games-october.html" "/leagues/NBA_2007_games-november.html" "/leagues/NBA_2007_games-december.html" "/leagues/NBA_2007_games-january.html" ...
#>  $ 2008: chr [1:9] "/leagues/NBA_2008_games-october.html" "/leagues/NBA_2008_games-november.html" "/leagues/NBA_2008_games-december.html" "/leagues/NBA_2008_games-january.html" ...

And then fetching all tables in a httr2-based pipeline where request rate is controlled by req_throttle():

# create base request, set rate to 20 requests per minute to 
# comply with https://www.sports-reference.com/bot-traffic.html
req_base <- 
  request("https://www.basketball-reference.com/") |> 
  req_throttle(20/60)

results <- 
  schedule_urls |> 
  unlist() |>
  # prepare list of requests
  map(req_url_path, req = req_base) |>
  # perform all prepared requests sequentially
  req_perform_sequential() |>
  # filter successes
  resps_successes() |>
  # extract tables from all monthly schedule pages, bind to a single tibble
  resps_data(\(resp) resp_body_html(resp) |> html_element("table") |> html_table(convert = FALSE))

Resulting frame:

results
#> # A tibble: 19,402 × 11
#>    Date    `Start (ET)` `Visitor/Neutral` PTS   `Home/Neutral` PTS   ``    ``   
#>    <chr>   <chr>        <chr>             <chr> <chr>          <chr> <chr> <chr>
#>  1 Tue, O… 7:30p        Philadelphia 76e… 88    Orlando Magic  95    Box … ""   
#>  2 Tue, O… 10:00p       Cleveland Cavali… 67    Sacramento Ki… 94    Box … ""   
#>  3 Tue, O… 10:30p       San Antonio Spurs 87    Los Angeles L… 82    Box … ""   
#>  4 Wed, O… 7:00p        Chicago Bulls     99    Boston Celtics 96    Box … ""   
#>  5 Wed, O… 7:00p        Washington Wizar… 68    Toronto Rapto… 74    Box … ""   
#>  6 Wed, O… 7:00p        Milwaukee Bucks   93    Philadelphia … 95    Box … ""   
#>  7 Wed, O… 7:00p        Houston Rockets   82    Indiana Pacers 91    Box … ""   
#>  8 Wed, O… 7:30p        Atlanta Hawks     94    New Jersey Ne… 105   Box … ""   
#>  9 Wed, O… 7:30p        Orlando Magic     100   Miami Heat     86    Box … ""   
#> 10 Wed, O… 8:00p        Denver Nuggets    77    Minnesota Tim… 83    Box … ""   
#> # ℹ 19,392 more rows
#> # ℹ 3 more variables: Attend. <chr>, Arena <chr>, Notes <chr>