Search code examples
rweb-scrapingweb-crawlerrcrawler

Rcrawler package: Rcrawler not crawling some websites


I'm using Rcrawler to crawl a vector of urls. For most of them it's working well, but every now and them one of them doesn't get crawled. At first I was only noticing this on https:// sites, which was addressed here. But I'm using version 0.1.7, which is supposed to have https:// capability.

I also found this other user who is having the same problem, but with http:// links as well. I checked on my instance and his websites didn't crawl properly for me either.

Here's what it I get when I try to crawl one of these sites:

>library(Rcrawler)
>Rcrawler("https://manager.submittable.com/beta/discover/?page=1&sort=")
>In process : 1..
Progress: 100.00 %  :  1  parssed from  1  | Collected pages: 1  | 
Level: 1 
+ Check INDEX dataframe variable to see crawling details 
+ Collected web pages are stored in Project folder 
+ Project folder name : manager.submittable.com-191922 
+ Project folder path : /home/anna/Documents/Rstudio/Submittable/manager.submittable.com-191922 

Any thoughts? Still waiting for a reply from the creator.


Solution

  • You try to crawl a password protected + javascript pages, you need a web driver to create a login session and render javascript elements, for this reason, Rcrawler V 0.1.9 implements a phantomjs webdriver .

    For your case start by installing the last version of Rcrawler then follow these steps :

    1 - Install web driver (actually phantomjs)

    library(Rcrawler)    
    install_browser()
    

    2 - Run the headless browser (a real browser but not visible br <-run_browser()

    If you get an error, this means that your operating system or antivirus is blocking the web driver (phantom.js) process, try to disable your antivirus temporarily or adjust your system configuration to allow phantomjs and processx executables

    3- Authenticate the session

     br<-LoginSession(Browser = br, LoginURL = 'https://manager.submittable.com/login',
                      LoginCredentials = c('your login','your pass'),
                      cssLoginFields =c('#email', '#password'),
                      XpathLoginButton ="//*[@type=\'submit\']" )
    

    4 - Crawl the website pages

    Rcrawler(Website ="https://manager.submittable.com/beta/discover/",no_cores = 1,no_conn = 1, LoggedSession = br, RequestsDelay = 3)
    

    You can access to webdriver functions using :

    br$session$
    

    RequestsDelay: 3 seconds given to each request knowing that some javascript take some time to be totally loaded

    no_cores=no_conn=1: retrieve pages one by one, as some websites deny multiple logged sessions.
    Rcrawler crawl/scrape password protected website submittable

    This supposed to crawl password protected web pages, however, bigger websites have an advanced protection against web scraping, like reCAPTCHA or other http/javascript rules that detect successive/automated requests. So it's better using their API if they provide one.

    we are still working on providing the ability to crawl multiple websites within one command. Till now you can only crawl each one separately, or use ContentScraper function if you want to scrape URLs/pages from the same website

    Rcrawler creator