How To Rotate Proxies and IP Addresses using R and rvest

I'm doing some scraping, but as I'm parsing approximately 4000 URL's, the website eventually detects my IP and blocks me every 20 iterations.

I've written a bunch of Sys.sleep(5) and a tryCatch so I'm not blocked too soon.

I use a VPN but I have to manually disconnect and reconnect it every now and then to change my IP. That's not a suitable solution with such a scraper supposed to run all night long.

I think rotating a proxy should do the job.

Here's my current code (a part of it at least) :

library(rvest)
library(dplyr)

scraped_data = data.frame()

for (i in urlsuffixes$suffix)
  {
  
  tryCatch({
    message("Let's scrape that, Buddy !")
    
    Sys.sleep(5)
 
    doctolib_url = paste0("https://www.website.com/test/", i)

    page = read_html(site_url)
    
    links = page %>%
      html_nodes(".seo-directory-doctor-link") %>%
      html_attr("href")
    
    Sys.sleep(5)
    
    name = page %>%
      html_nodes(".seo-directory-doctor-link") %>%
      html_text()
    
    Sys.sleep(5)
    
    job_title = page %>%
      html_nodes(".seo-directory-doctor-speciality") %>%
      html_text()
    
    Sys.sleep(5)
    
    address = page %>%
      html_nodes(".seo-directory-doctor-address") %>%
      html_text()
    
    Sys.sleep(5)
    
    scraped_data = rbind(scraped_data, data.frame(links,
                                                  name,
                                                  address,
                                                  job_title,
                                                  stringsAsFactors = FALSE))
    
  }, error=function(e){cat("Houston, we have a problem !","\n",conditionMessage(e),"\n")})
  print(paste("Page : ", i))
}

Solution

Interesting question. I think the first thing to note is that, as mentioned on this Github issue, rvest and xml2 use httr for the connections. As such, I'm going to introduce httr into this answer.

Using a proxy with httr

The following code chunk shows how to use httr to query a url using a proxy and extract the html content.

page <- httr::content(
    httr::GET(
        url, 
        httr::use_proxy(ip, port, username, password)
    )
)

If you are using IP authentication or don't need a username and password, you can simply exclude those values from the call.

In short, you can replace the page = read_html(site_url) with the code chunk above.

Rotating the Proxies

One big problem with using proxies is getting reliable ones. For this, I'm just going to assume that you have a reliable source. Since you haven't indicated otherwise, I'm going to assume that your proxies are stored in the following reasonable format with object name proxies:

ip	port
64.235.204.107	8080
167.71.190.253	80
185.156.172.122	3128

With that format in mind, you could tweak the script chunk above to rotate proxies for every web request as follows:

library(dplyr)
library(httr)
library(rvest)

scraped_data = data.frame()

for (i in 1:length(urlsuffixes$suffix))
  {
  
  tryCatch({
    message("Let's scrape that, Buddy !")
    
    Sys.sleep(5)
 
    doctolib_url = paste0("https://www.website.com/test/", 
                          urlsuffixes$suffix[[i]])
   
   # The number of urls is longer than the proxy list -- which proxy to use
   # I know this isn't the greatest, but it works so whatever
   proxy_id <- ifelse(i %% nrow(proxies) == 0, nrow(proxies), i %% nrow(proxies))

    page <- httr::content(
        httr::GET(
            doctolib_url, 
            httr::use_proxy(proxies$ip[[proxy_id]], proxies$port[[proxy_id]])
        )
    )
    
    links = page %>%
      html_nodes(".seo-directory-doctor-link") %>%
      html_attr("href")
    
    Sys.sleep(5)
    
    name = page %>%
      html_nodes(".seo-directory-doctor-link") %>%
      html_text()
    
    Sys.sleep(5)
    
    job_title = page %>%
      html_nodes(".seo-directory-doctor-speciality") %>%
      html_text()
    
    Sys.sleep(5)
    
    address = page %>%
      html_nodes(".seo-directory-doctor-address") %>%
      html_text()
    
    Sys.sleep(5)
    
    scraped_data = rbind(scraped_data, data.frame(links,
                                                  name,
                                                  address,
                                                  job_title,
                                                  stringsAsFactors = FALSE))
    
  }, error=function(e){cat("Houston, we have a problem !","\n",conditionMessage(e),"\n")})
  print(paste("Page : ", i))
}

This may not be enough

You might want to go a few steps further and add elements to the httr request such as the user-agent etc. However, one of the big problems with a package like httr is that it can't render dynamic html content, such as JavaScript-rendered html, and any website that really cares about blocking scrapers is going to detect this. To conquer this problem there are tools such as Headless Chrome that are meant to address specifically stuff like this. Here's a package you might want to look into for headless Chrome in R NOTE: still in development.

Disclaimer

Obviously, I think this code will work but since there's no reproducible data to test with, it may not.