Stormcrawler workaround for pages with http 405 code

I wanted to crawl a webpage like this one.

It seems that I get a 405 error

2018-04-09 11:18:40.930 c.d.s.b.FetcherBolt FetcherThread #2 [INFO] [Fetcher #3] Fetched https://www.notebooksbilliger.de/lenovo+320+15abr+80xs009bge/incrpc/topprod with status 405 in msec 53

The page seems to have crawler protection. Is it possible to still crawl it using stormcrawler together with selenium maybe?

Solution

That site does not disallow robots but returns a 405 if the user agent does not look like a browser. You can reproduce the issue with curl

curl -A -I "https://www.notebooksbilliger.de/lenovo+320+15abr+80xs009bge"

HTTP/1.1 405 Method Not Allowed
Accept-Ranges: bytes
Content-Type: text/html
Server: nginx
Surrogate-Control: no-store, bypass-cache
X-Distil-CS: BYPASS
Expires: Mon, 09 Apr 2018 10:48:02 GMT
Cache-Control: max-age=0, no-cache, no-store
Pragma: no-cache
Date: Mon, 09 Apr 2018 10:48:02 GMT
Connection: keep-alive

curl -A "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36" -I "https://www.notebooksbilliger.de/lenovo+320+15abr+80xs009bge"

HTTP/1.1 200 OK
Content-Type: text/html
Server: nginx
Surrogate-Control: no-store, bypass-cache
Expires: Mon, 09 Apr 2018 10:48:26 GMT
Cache-Control: max-age=0, no-cache, no-store
Pragma: no-cache
Date: Mon, 09 Apr 2018 10:48:26 GMT
Connection: keep-alive

One workaround could be to use selenium as suggested or simply to change the user agent so that it mimics what a browser would use. Not great, as it is always preferable to be open about your crawler but in that particular case, the site would prevent crawlers in their robots.txt if that was their intention.

You can change the user agent via configuration in StormCrawler.