Search code examples
web-scrapingpython-requestsscrapyweb-crawler

`scrapy` can't get response from a website but `requests` can


I am using scrapy to crawl this page

but for some reason scrapy cannot receive a response from this website. when i run the crawler I receive https 500 error

here is my basic spider

import scrapy

class SavingsGov(scrapy.Spider):
    name        = 'savings'
    start_urls  = [
        'https://savings.gov.pk/download-draws/'
    ]

    def parse(self, response):
        for option in response.css('select option'):
            yield {
                'url': option.css('::attr(value)').get()
            }

and here are the errors I get when I run it, (I have also increased the number of retries to 10 in settings.py)

2023-08-26 16:30:22 [scrapy.core.engine] INFO: Spider opened
2023-08-26 16:30:22 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2023-08-26 16:30:22 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2023-08-26 16:30:24 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/robots.txt> (failed 1 times): 500 Internal Server Error
2023-08-26 16:30:25 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/robots.txt> (failed 2 times): 500 Internal Server Error
2023-08-26 16:30:27 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/robots.txt> (failed 3 times): 500 Internal Server Error
2023-08-26 16:30:28 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/robots.txt> (failed 4 times): 500 Internal Server Error
2023-08-26 16:30:30 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/robots.txt> (failed 5 times): 500 Internal Server Error
2023-08-26 16:30:31 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/robots.txt> (failed 6 times): 500 Internal Server Error
2023-08-26 16:30:33 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/robots.txt> (failed 7 times): 500 Internal Server Error
2023-08-26 16:30:35 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/robots.txt> (failed 8 times): 500 Internal Server Error
2023-08-26 16:30:37 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/robots.txt> (failed 9 times): 500 Internal Server Error
2023-08-26 16:30:39 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/robots.txt> (failed 10 times): 500 Internal Server Error
2023-08-26 16:30:40 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <GET https://savings.gov.pk/robots.txt> (failed 11 times): 500 Internal Server Error
2023-08-26 16:30:40 [scrapy.core.engine] DEBUG: Crawled (500) <GET https://savings.gov.pk/robots.txt> (referer: None)
2023-08-26 16:30:40 [protego] DEBUG: Rule at line 1 without any user agent to enforce it on.
2023-08-26 16:30:41 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/download-draws/> (failed 1 times): 500 Internal Server Error
2023-08-26 16:30:43 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/download-draws/> (failed 2 times): 500 Internal Server Error
2023-08-26 16:30:44 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/download-draws/> (failed 3 times): 500 Internal Server Error
2023-08-26 16:30:46 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/download-draws/> (failed 4 times): 500 Internal Server Error
2023-08-26 16:30:47 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/download-draws/> (failed 5 times): 500 Internal Server Error
2023-08-26 16:30:49 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/download-draws/> (failed 6 times): 500 Internal Server Error
2023-08-26 16:30:50 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/download-draws/> (failed 7 times): 500 Internal Server Error
2023-08-26 16:30:52 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/download-draws/> (failed 8 times): 500 Internal Server Error
2023-08-26 16:30:53 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/download-draws/> (failed 9 times): 500 Internal Server Error
2023-08-26 16:30:55 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/download-draws/> (failed 10 times): 500 Internal Server Error
2023-08-26 16:30:56 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <GET https://savings.gov.pk/download-draws/> (failed 11 times): 500 Internal Server Error
2023-08-26 16:30:56 [scrapy.core.engine] DEBUG: Crawled (500) <GET https://savings.gov.pk/download-draws/> (referer: None)
2023-08-26 16:30:56 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <500 https://savings.gov.pk/download-draws/>: HTTP status code is not handled or not allowed
2023-08-26 16:30:56 [scrapy.core.engine] INFO: Closing spider (finished)

but I can easily get a response using python's requests module. here is the python code for that

import requests

response = requests.get('https://savings.gov.pk/download-draws/')
print(response.text)

I don't know why this is happening, I am assuming that the problem is with scrapy.Request.

is there any way to perform requests with requests and pass the response to scrapy? but the preferable option would be to somehow debug scrapy.Request

I am new to scrapy so if there is a possibility that I'm misunderstanding the problem, please let me know.


Solution

  • It is most likely because the server probably rejects requests from scrapy default user agent.

    Try setting a custom one in the spiders custom settings. Also set ROBOTSTXT_OBEY to false.

    For example:

    import scrapy
    
    class SavingsGov(scrapy.Spider):
        name        = 'savings'
        start_urls  = [
            'https://savings.gov.pk/download-draws/'
        ]
        custom_settings = {
            "USER_AGENT": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36",
            "ROBOTSTXT_OBEY": False
        }
    
        def parse(self, response):
            for option in response.css('select option'):
                yield {
                    'url': option.css('::attr(value)').get()
                }
    

    Partial output:

    2023-08-26 21:11:24 [scrapy.core.scraper] DEBUG: Scraped from <200 https://savings.gov.pk/download-draws/>
    {'url': 'http://savings.gov.pk/rs-1500-draw-list/'}
    2023-08-26 21:11:24 [scrapy.core.scraper] DEBUG: Scraped from <200 https://savings.gov.pk/download-draws/>
    {'url': 'http://savings.gov.pk/rs-200-draws/'}
    2023-08-26 21:11:24 [scrapy.core.scraper] DEBUG: Scraped from <200 https://savings.gov.pk/download-draws/>
    {'url': 'http://savings.gov.pk/rs-1500-draws/'}
    2023-08-26 21:11:24 [scrapy.core.scraper] DEBUG: Scraped from <200 https://savings.gov.pk/download-draws/>
    {'url': 'http://savings.gov.pk/rs-25000-premium-bonds-draws/'}
    2023-08-26 21:11:24 [scrapy.core.scraper] DEBUG: Scraped from <200 https://savings.gov.pk/download-draws/>
    {'url': 'http://savings.gov.pk/rs-15000-draws/'}
    2023-08-26 21:11:24 [scrapy.core.scraper] DEBUG: Scraped from <200 https://savings.gov.pk/download-draws/>
    {'url': 'http://savings.gov.pk/rs-40000-premium-bonds-draws/'}
    2023-08-26 21:11:24 [scrapy.core.scraper] DEBUG: Scraped from <200 https://savings.gov.pk/download-draws/>
    {'url': 'http://savings.gov.pk/rs-40000-draws/'}
    2023-08-26 21:11:24 [scrapy.core.scraper] DEBUG: Scraped from <200 https://savings.gov.pk/download-draws/>
    {'url': 'http://savings.gov.pk/rs-25000-draws/'}
    2023-08-26 21:11:24 [scrapy.core.scraper] DEBUG: Scraped from <200 https://savings.gov.pk/download-draws/>
    {'url': 'http://savings.gov.pk/rs-7500-draws/'}