I am scraping walmart.com using scrapy. when i am fetching https://www.walmart.com/ there is no error but when trying to fetch "https://www.walmart.com/search?q=tablets&typeahead=tabltes" the below error appears: I have already disabled obey robot.text and employed scrapy fake user agents.
2024-02-14 09:42:25 [scrapy.downloadermiddlewares.robotstxt] DEBUG: Forbidden by robots.txt: <GET https://www.walmart.com/search?q=tablets&typeahead=tabltes> 2024-02-14 09:42:25 [py.warnings] WARNING: C:\Users\SADAM1\PycharmProjects\untitled4\v
import scrapy
class Wal1Spider(scrapy.Spider):
name = "wal1"
allowed_domains = ["walmart.com"]
start_urls = ["https://walmart.com"]
custom_settings = {
"DOWNLOAD_DELAY": 6.3,
"RANDOMIZE_DOWNLOAD_DELAY": True,
"COOKIES_ENABLED": False,
"AUTOTHROTTLE_ENABLED": True,
"AUTOTHROTTLE_START_DELAY ": 2,
"AUTOTHROTTLE_MAX_DELAY": 11.7,
"AUTOTHROTTLE_TARGET_CONCURRENCY": 1,
"CONCURRENT_REQUESTS": 4,
"ROBOTSTXT_OBEY": False,
}
def parse(self, response):
pass
env\lib\site-packages\scrapy_fake_useragent\middleware.py:95: ScrapyDeprecation Warning: Attribute RetryMiddleware.EXCEPTIONS_TO_RETRY is deprecated. Use the RETRY_EXCEPTIONS setting instead. if isinstance(exception, self.EXCEPTIONS_TO_RETRY) [enter image description here](https://i.sstatic.net/oeTg0.png)
i have tried disabling robot.text obey and employed scrapy fake user agents
If you're using the Scrapy shell, then the settings you define in your spider aren't used. You could try passing that specific option when using scrapy shell
through the --set
flag which sets/overrides settings;
$ scrapy shell --set="ROBOTSTXT_OBEY=False"
once there;
fetch("https://www.walmart.com/search?q=tablets&typeahead=tablte")
# 2024-02-14 19:24:29 [scrapy.core.engine] INFO: Spider opened
# 2024-02-14 19:24:29 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.walmart.com/search?q=tablets&typeahead=tablte> (referer: None)