I'm trying to scrape some data from websites with rvest
. I have a tibble of thousands of URLs, and I need to extract one piece of data from each URL. In order to not be blocked by the main site I'm visiting, I need to rest about 2 minutes after each 200 URLs I visit (learned this via trial and error). I'm wondering how I can use sys.sleep to do this.
My current code is below. I am going to each URL in url_tibble and pulling data (".verified").
# Function to extract data
get_data <- function(x) {
read_html(x) %>%
html_nodes(".verified") %>%
html_attr("href")
}
# Extract data
data_I_need <- url_tibble %>%
mutate(profile = map(url, ~ get_data(.x)),)
This code works for a limited number of URLS, until I get blocked for trying to scrape from the site too quickly. To avoid being blocked, I'd like to pause for 2 minutes after each 200 URLs using sys.sleep. Can you help me figure out how to do this?
The best recommendation I found for how to do this was from the solution posted on Recommendation when using Sys.sleep() in R with rvest, but I can't figure out how to integrate the solution with my code. This solution uses loops instead of map
. I tried doing something like this:
output <- vector(length = length(url_tibble$url))
for(i in 1:length(url_tibble$url)) {
data_I_need <- read_html(url_tibble$url[i]) %>%
html_nodes(".verified") %>%
html_attr("href")
output[i] <- data_I_need
if((i %% 200) == 0){
Sys.sleep(160)
}
}
However, this does not work either, and I receive an error message.
We can lapply
in lieu of a loop. Also, I have added an https://
to each URL such that read_html
recognises them as links not files. We should replace 2
with 200
for the actual data.
lapply(1:length(url_tibble$url), function(x){
if(x%%2 == 0){
print(paste0("Sleeping at ", x))
Sys.sleep(20)
}
read_html(paste0("https://",url_tibble$url[x])) %>%
html_nodes(".verified") %>%
html_attr("href")
})
Output (truncated)
[1] "Sleeping at 2"
[1] "Sleeping at 4"
[1] "Sleeping at 6"
[[1]]
[1] "https://www.psychologytoday.com/us/therapists/aak-bright-start-rego-park-ny/936718"
[2] "https://www.psychologytoday.com/us/therapists/leslie-aaron-new-york-ny/148793"
[3] "https://www.psychologytoday.com/us/therapists/lindsay-aaron-frieman-new-york-ny/761657"
[4] "https://www.psychologytoday.com/us/therapists/fay-m-aaronson-brooklyn-ny/840861"
[5] "https://www.psychologytoday.com/us/therapists/anita-aasen-staten-island-ny/291614"
[6] "https://www.psychologytoday.com/us/therapists/aask-therapeutic-services-fishkill-ny/185423"
[7] "https://www.psychologytoday.com/us/therapists/amanda-abady-brooklyn-ny/935849"
[8] "https://www.psychologytoday.com/us/therapists/denise-abatemarco-new-york-ny/143678"
[9] "https://www.psychologytoday.com/us/therapists/raya-abat-robinson-new-york-ny/810730"