Search code examples
rrvest

Using sys.sleep in rvest


I'm trying to scrape some data from websites with rvest. I have a tibble of thousands of URLs, and I need to extract one piece of data from each URL. In order to not be blocked by the main site I'm visiting, I need to rest about 2 minutes after each 200 URLs I visit (learned this via trial and error). I'm wondering how I can use sys.sleep to do this.

My current code is below. I am going to each URL in url_tibble and pulling data (".verified").

# Function to extract data
get_data <- function(x) {
  read_html(x) %>%
    html_nodes(".verified") %>%
    html_attr("href") 
}

# Extract data
data_I_need <- url_tibble %>%
  mutate(profile = map(url, ~ get_data(.x)),)

This code works for a limited number of URLS, until I get blocked for trying to scrape from the site too quickly. To avoid being blocked, I'd like to pause for 2 minutes after each 200 URLs using sys.sleep. Can you help me figure out how to do this?

The best recommendation I found for how to do this was from the solution posted on Recommendation when using Sys.sleep() in R with rvest, but I can't figure out how to integrate the solution with my code. This solution uses loops instead of map. I tried doing something like this:

output <- vector(length = length(url_tibble$url))
                 
for(i in 1:length(url_tibble$url)) {
  data_I_need <-  read_html(url_tibble$url[i]) %>%
          html_nodes(".verified") %>%
          html_attr("href") 
  output[i] <- data_I_need
    if((i %% 200) == 0){
      Sys.sleep(160)
    }
  } 

However, this does not work either, and I receive an error message.


Solution

  • We can lapply in lieu of a loop. Also, I have added an https:// to each URL such that read_html recognises them as links not files. We should replace 2 with 200 for the actual data.

     lapply(1:length(url_tibble$url), function(x){
      if(x%%2 == 0){
        print(paste0("Sleeping at ", x))
        Sys.sleep(20)
      }
      read_html(paste0("https://",url_tibble$url[x])) %>%
        html_nodes(".verified") %>%
        html_attr("href") 
    })
    

    Output (truncated)

    [1] "Sleeping at 2"
    [1] "Sleeping at 4"
    [1] "Sleeping at 6"
    [[1]]
     [1] "https://www.psychologytoday.com/us/therapists/aak-bright-start-rego-park-ny/936718"                   
     [2] "https://www.psychologytoday.com/us/therapists/leslie-aaron-new-york-ny/148793"                        
     [3] "https://www.psychologytoday.com/us/therapists/lindsay-aaron-frieman-new-york-ny/761657"               
     [4] "https://www.psychologytoday.com/us/therapists/fay-m-aaronson-brooklyn-ny/840861"                      
     [5] "https://www.psychologytoday.com/us/therapists/anita-aasen-staten-island-ny/291614"                    
     [6] "https://www.psychologytoday.com/us/therapists/aask-therapeutic-services-fishkill-ny/185423"           
     [7] "https://www.psychologytoday.com/us/therapists/amanda-abady-brooklyn-ny/935849"                        
     [8] "https://www.psychologytoday.com/us/therapists/denise-abatemarco-new-york-ny/143678"                   
     [9] "https://www.psychologytoday.com/us/therapists/raya-abat-robinson-new-york-ny/810730"