When I am in scrapy shell and I run:
fetch('https://www.google.nl')
Then I get a normal response:
2020-11-19 12:42:00 [scrapy.core.engine] INFO: Spider opened
2020-11-19 12:42:00 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.google.nl> (referer: None)
But when I do this for Zalando pages, for example:
fetch('https://www.zalando.de/nike-sportswear-pant-jogginghose-ni121a09o-c11.html')
Then I only see:
2020-11-19 12:46:06 [scrapy.core.engine] INFO: Spider opened
And after a while I get a timeout. Why is this not working for Zalando pages? Or: what should I change to make this work?
Include a User Agent in your Request's headers, this worked fine for me:
from scrapy import Request
url='https://www.zalando.de/nike-sportswear-pant-jogginghose-ni121a09o-c11.html'
req = Request(url, headers={
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Firefox/78.0'
})
fetch(req)
Could be a anti-bot measure