Search code examples
pythonweb-scrapingscrapy

Python Scrapy Shell Error While Scraping Wallmart


I am scraping walmart.com using scrapy. when i am fetching https://www.walmart.com/ there is no error but when trying to fetch "https://www.walmart.com/search?q=tablets&typeahead=tabltes" the below error appears: I have already disabled obey robot.text and employed scrapy fake user agents.

2024-02-14 09:42:25 [scrapy.downloadermiddlewares.robotstxt] DEBUG: Forbidden by robots.txt: <GET https://www.walmart.com/search?q=tablets&typeahead=tabltes> 2024-02-14 09:42:25 [py.warnings] WARNING: C:\Users\SADAM1\PycharmProjects\untitled4\v

import scrapy

class Wal1Spider(scrapy.Spider):
    name = "wal1"
    allowed_domains = ["walmart.com"]
    start_urls = ["https://walmart.com"]
    
    
    custom_settings = {
        "DOWNLOAD_DELAY": 6.3,
        "RANDOMIZE_DOWNLOAD_DELAY": True,
        "COOKIES_ENABLED": False,
        "AUTOTHROTTLE_ENABLED": True,
        "AUTOTHROTTLE_START_DELAY ": 2,
        "AUTOTHROTTLE_MAX_DELAY": 11.7,
        "AUTOTHROTTLE_TARGET_CONCURRENCY": 1,
        "CONCURRENT_REQUESTS": 4,
        "ROBOTSTXT_OBEY": False,
    }
    def parse(self, response):

        pass

env\lib\site-packages\scrapy_fake_useragent\middleware.py:95: ScrapyDeprecation Warning: Attribute RetryMiddleware.EXCEPTIONS_TO_RETRY is deprecated. Use the RETRY_EXCEPTIONS setting instead. if isinstance(exception, self.EXCEPTIONS_TO_RETRY) [enter image description here](https://i.sstatic.net/oeTg0.png)

i have tried disabling robot.text obey and employed scrapy fake user agents


Solution

  • If you're using the Scrapy shell, then the settings you define in your spider aren't used. You could try passing that specific option when using scrapy shell through the --set flag which sets/overrides settings;

    $ scrapy shell --set="ROBOTSTXT_OBEY=False"
    

    once there;

    fetch("https://www.walmart.com/search?q=tablets&typeahead=tablte")
    # 2024-02-14 19:24:29 [scrapy.core.engine] INFO: Spider opened
    # 2024-02-14 19:24:29 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.walmart.com/search?q=tablets&typeahead=tablte> (referer: None)