Search code examples
scrapy

scrapy failed to fetch, but curl or browser can retrieve the page


I have simple scrapy spider.

import scrapy
from scrapy.crawler import CrawlerProcess

class ScraperSpider(scrapy.Spider):
    name = "scraper"

    def start_requests(self):
        urls = [
            'https://api.ipify.org?format=json',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        self.logger.info('================Request: %s, IP address: %s' % (response.request, response.text))

if __name__ == "__main__":
    process = CrawlerProcess()
    process.crawl(ScraperSpider)
    process.start()

However, it gives an error:

2023-12-18 23:56:34 [scrapy.core.engine] DEBUG: Crawled (400) <GET https://api.ipify.org?format=json> (referer: None)
2023-12-18 23:56:34 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <400 https://api.ipify.org?format=json>: HTTP status code is not handled or not allowed
2023-12-18 23:56:34 [scrapy.core.engine] INFO: Closing spider (finished)

But actually the url can be fetched with curl or browser.


Solution

  • Add a / before ? in the url:

    import scrapy
    
    
    class ScraperSpider(scrapy.Spider):
        name = "scraper"
    
        def start_requests(self):
            urls = [
                'https://api.ipify.org/?format=json',
            ]
            for url in urls:
                yield scrapy.Request(url=url)
    
        def parse(self, response):
            self.logger.info('================Request: %s, IP address: %s' % (response.request, response.json().get('ip')))
    

    Output:

    [scrapy.core.engine] DEBUG: Crawled (200) <GET https://api.ipify.org/?format=json> (referer: None)
    [scraper] INFO: ================Request: <GET https://api.ipify.org/?format=json>, IP address: X.X.X.X