Search code examples
pythonscrapy

How to fix 403 error while scraping with scrapy?


I keep getting 403 error when using scrapy, even though I have proper headers set. The website, I am trying to scrape is https://steamdb.info/graph/.

My code:

def start_request(self):
        headers =  {"user-agent": "Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.67 Mobile Safari/537.36",
"accept": "application/json",
"accept-encoding": "gzip, deflate, br",
"accept-language": "en-US,en;q=0.9,en-GB;q=0.8,ar;q=0.7",
"cache-control":" no-cache",
"pragma": "no-cache",
"referer": "https://steamdb.info/graph/", 
"sec-fetch-dest": "empty",
"sec-fetch-mode": "cors",
"sec-fetch-site": "same-origin",
"x-requested-with": "XMLHttpRequest"
            }

        yield scrapy.Request(url = 'https://steamdb.info/graph', method='GET', headers = headers, callback=self.parse)
        

    def parse(self, response):    
        #stuff to do

Error:

2022-07-08 20:20:41 [scrapy.core.engine] DEBUG: Crawled (403) <GET https://steamdb.info/graph> (referer: https://steamdb.info/graph/)
2022-07-08 20:20:41 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 https://steamdb.info/graph>: HTTP status code is not handled or not allowed

Solution

  • CloudScraper worked for me:

    pip install cloudscraper
    

    Then add middleware to your settings.py:

    "DOWNLOADER_MIDDLEWARES": {
        "YOUR_PATH.AntiBanMiddleware": 543
      },
    

    Here is the AntiBanMiddleware:

    class AntiBanMiddleware:
        cloudflare_scraper = cloudscraper.create_scraper()
    
        def process_response(self, request, response, spider):
            request_url = request.url
            response_status = response.status
            if response_status not in (403, 503):
                return response
    
            spider.logger.info("Cloudflare detected. Using cloudscraper on URL: %s", request_url)
            cflare_response = self.cloudflare_scraper.get(request_url)
            cflare_res_transformed = HtmlResponse(url=request_url, body=cflare_response.text, encoding='utf-8')
            return cflare_res_transformed