I'm doing some scraping, but as I'm parsing approximately 4000 URL's, the website eventually detects my IP and blocks me every 20 iterations.
I've written a bunch of Sys.sleep(5)
and a tryCatch
so I'm not blocked too soon.
I use a VPN but I have to manually disconnect and reconnect it every now and then to change my IP. That's not a suitable solution with such a scraper supposed to run all night long.
I think rotating a proxy should do the job.
Here's my current code (a part of it at least) :
library(rvest)
library(dplyr)
scraped_data = data.frame()
for (i in urlsuffixes$suffix)
{
tryCatch({
message("Let's scrape that, Buddy !")
Sys.sleep(5)
doctolib_url = paste0("https://www.website.com/test/", i)
page = read_html(site_url)
links = page %>%
html_nodes(".seo-directory-doctor-link") %>%
html_attr("href")
Sys.sleep(5)
name = page %>%
html_nodes(".seo-directory-doctor-link") %>%
html_text()
Sys.sleep(5)
job_title = page %>%
html_nodes(".seo-directory-doctor-speciality") %>%
html_text()
Sys.sleep(5)
address = page %>%
html_nodes(".seo-directory-doctor-address") %>%
html_text()
Sys.sleep(5)
scraped_data = rbind(scraped_data, data.frame(links,
name,
address,
job_title,
stringsAsFactors = FALSE))
}, error=function(e){cat("Houston, we have a problem !","\n",conditionMessage(e),"\n")})
print(paste("Page : ", i))
}
Interesting question. I think the first thing to note is that, as mentioned on this Github issue, rvest
and xml2
use httr
for the connections. As such, I'm going to introduce httr
into this answer.
The following code chunk shows how to use httr
to query a url using a proxy and extract the html content.
page <- httr::content(
httr::GET(
url,
httr::use_proxy(ip, port, username, password)
)
)
If you are using IP authentication or don't need a username and password, you can simply exclude those values from the call.
In short, you can replace the page = read_html(site_url)
with the code chunk above.
One big problem with using proxies is getting reliable ones. For this, I'm just going to assume that you have a reliable source. Since you haven't indicated otherwise, I'm going to assume that your proxies are stored in the following reasonable format with object name proxies
:
ip | port |
---|---|
64.235.204.107 | 8080 |
167.71.190.253 | 80 |
185.156.172.122 | 3128 |
With that format in mind, you could tweak the script chunk above to rotate proxies for every web request as follows:
library(dplyr)
library(httr)
library(rvest)
scraped_data = data.frame()
for (i in 1:length(urlsuffixes$suffix))
{
tryCatch({
message("Let's scrape that, Buddy !")
Sys.sleep(5)
doctolib_url = paste0("https://www.website.com/test/",
urlsuffixes$suffix[[i]])
# The number of urls is longer than the proxy list -- which proxy to use
# I know this isn't the greatest, but it works so whatever
proxy_id <- ifelse(i %% nrow(proxies) == 0, nrow(proxies), i %% nrow(proxies))
page <- httr::content(
httr::GET(
doctolib_url,
httr::use_proxy(proxies$ip[[proxy_id]], proxies$port[[proxy_id]])
)
)
links = page %>%
html_nodes(".seo-directory-doctor-link") %>%
html_attr("href")
Sys.sleep(5)
name = page %>%
html_nodes(".seo-directory-doctor-link") %>%
html_text()
Sys.sleep(5)
job_title = page %>%
html_nodes(".seo-directory-doctor-speciality") %>%
html_text()
Sys.sleep(5)
address = page %>%
html_nodes(".seo-directory-doctor-address") %>%
html_text()
Sys.sleep(5)
scraped_data = rbind(scraped_data, data.frame(links,
name,
address,
job_title,
stringsAsFactors = FALSE))
}, error=function(e){cat("Houston, we have a problem !","\n",conditionMessage(e),"\n")})
print(paste("Page : ", i))
}
You might want to go a few steps further and add elements to the httr
request such as the user-agent etc. However, one of the big problems with a package like httr
is that it can't render dynamic html content, such as JavaScript-rendered html, and any website that really cares about blocking scrapers is going to detect this. To conquer this problem there are tools such as Headless Chrome that are meant to address specifically stuff like this. Here's a package you might want to look into for headless Chrome in R NOTE: still in development.
Obviously, I think this code will work but since there's no reproducible data to test with, it may not.