Search code examples
web-crawlerstormcrawler

Stormcrawler workaround for pages with http 405 code


I wanted to crawl a webpage like this one.

It seems that I get a 405 error

2018-04-09 11:18:40.930 c.d.s.b.FetcherBolt FetcherThread #2 [INFO] [Fetcher #3] Fetched https://www.notebooksbilliger.de/lenovo+320+15abr+80xs009bge/incrpc/topprod with status 405 in msec 53

The page seems to have crawler protection. Is it possible to still crawl it using stormcrawler together with selenium maybe?


Solution

  • That site does not disallow robots but returns a 405 if the user agent does not look like a browser. You can reproduce the issue with curl

    curl -A -I "https://www.notebooksbilliger.de/lenovo+320+15abr+80xs009bge"
    
    HTTP/1.1 405 Method Not Allowed
    Accept-Ranges: bytes
    Content-Type: text/html
    Server: nginx
    Surrogate-Control: no-store, bypass-cache
    X-Distil-CS: BYPASS
    Expires: Mon, 09 Apr 2018 10:48:02 GMT
    Cache-Control: max-age=0, no-cache, no-store
    Pragma: no-cache
    Date: Mon, 09 Apr 2018 10:48:02 GMT
    Connection: keep-alive
    
    curl -A "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36" -I "https://www.notebooksbilliger.de/lenovo+320+15abr+80xs009bge"
    
    HTTP/1.1 200 OK
    Content-Type: text/html
    Server: nginx
    Surrogate-Control: no-store, bypass-cache
    Expires: Mon, 09 Apr 2018 10:48:26 GMT
    Cache-Control: max-age=0, no-cache, no-store
    Pragma: no-cache
    Date: Mon, 09 Apr 2018 10:48:26 GMT
    Connection: keep-alive
    

    One workaround could be to use selenium as suggested or simply to change the user agent so that it mimics what a browser would use. Not great, as it is always preferable to be open about your crawler but in that particular case, the site would prevent crawlers in their robots.txt if that was their intention.

    You can change the user agent via configuration in StormCrawler.