Search code examples

Using sys.sleep in rvest

I'm trying to scrape some data from websites with rvest. I have a tibble of thousands of URLs, and I need to extract one piece of data from each URL. In order to not be blocked by the main site I'm visiting, I need to rest about 2 minutes after each 200 URLs I visit (learned this via trial and error). I'm wondering how I can use sys.sleep to do this.

My current code is below. I am going to each URL in url_tibble and pulling data (".verified").

# Function to extract data
get_data <- function(x) {
  read_html(x) %>%
    html_nodes(".verified") %>%

# Extract data
data_I_need <- url_tibble %>%
  mutate(profile = map(url, ~ get_data(.x)),)

This code works for a limited number of URLS, until I get blocked for trying to scrape from the site too quickly. To avoid being blocked, I'd like to pause for 2 minutes after each 200 URLs using sys.sleep. Can you help me figure out how to do this?

The best recommendation I found for how to do this was from the solution posted on Recommendation when using Sys.sleep() in R with rvest, but I can't figure out how to integrate the solution with my code. This solution uses loops instead of map. I tried doing something like this:

output <- vector(length = length(url_tibble$url))
for(i in 1:length(url_tibble$url)) {
  data_I_need <-  read_html(url_tibble$url[i]) %>%
          html_nodes(".verified") %>%
  output[i] <- data_I_need
    if((i %% 200) == 0){

However, this does not work either, and I receive an error message.


  • We can lapply in lieu of a loop. Also, I have added an https:// to each URL such that read_html recognises them as links not files. We should replace 2 with 200 for the actual data.

     lapply(1:length(url_tibble$url), function(x){
      if(x%%2 == 0){
        print(paste0("Sleeping at ", x))
      read_html(paste0("https://",url_tibble$url[x])) %>%
        html_nodes(".verified") %>%

    Output (truncated)

    [1] "Sleeping at 2"
    [1] "Sleeping at 4"
    [1] "Sleeping at 6"
     [1] ""                   
     [2] ""                        
     [3] ""               
     [4] ""                      
     [5] ""                    
     [6] ""           
     [7] ""                        
     [8] ""                   
     [9] ""