I wanted to crawl a webpage like this one.
It seems that I get a 405 error
2018-04-09 11:18:40.930 c.d.s.b.FetcherBolt FetcherThread #2 [INFO] [Fetcher #3] Fetched https://www.notebooksbilliger.de/lenovo+320+15abr+80xs009bge/incrpc/topprod with status 405 in msec 53
The page seems to have crawler protection. Is it possible to still crawl it using stormcrawler together with selenium maybe?
That site does not disallow robots but returns a 405 if the user agent does not look like a browser. You can reproduce the issue with curl
curl -A -I "https://www.notebooksbilliger.de/lenovo+320+15abr+80xs009bge"
HTTP/1.1 405 Method Not Allowed
Accept-Ranges: bytes
Content-Type: text/html
Server: nginx
Surrogate-Control: no-store, bypass-cache
X-Distil-CS: BYPASS
Expires: Mon, 09 Apr 2018 10:48:02 GMT
Cache-Control: max-age=0, no-cache, no-store
Pragma: no-cache
Date: Mon, 09 Apr 2018 10:48:02 GMT
Connection: keep-alive
curl -A "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36" -I "https://www.notebooksbilliger.de/lenovo+320+15abr+80xs009bge"
HTTP/1.1 200 OK
Content-Type: text/html
Server: nginx
Surrogate-Control: no-store, bypass-cache
Expires: Mon, 09 Apr 2018 10:48:26 GMT
Cache-Control: max-age=0, no-cache, no-store
Pragma: no-cache
Date: Mon, 09 Apr 2018 10:48:26 GMT
Connection: keep-alive
One workaround could be to use selenium as suggested or simply to change the user agent so that it mimics what a browser would use. Not great, as it is always preferable to be open about your crawler but in that particular case, the site would prevent crawlers in their robots.txt if that was their intention.
You can change the user agent via configuration in StormCrawler.