Search code examples
pythonhttp-redirectscrapy

How to allow scrapy to follow redirects?


I am trying to scrape data from historical versions of web pages as backed up Wayback Machine.

I have thousands of pages that need scraping and I don't want to go to trouble of finding out exact dates and time of available backups for each of them. I just want to get weekly historical data or the nearest available.

What I know is that if I put a date in a link here:

https://web.archive.org/web/<some_date>/<some_url>

then Wayback Machine will automatically redirect to the closest available capture. This will work fine in my scenario.

I have a scrapy spider that extracts the data and that I already successfully used on the current version of web pages, so I know that it works and it produces the correct output. But when I try to run scrapy on the backed up versions of pages I get the following output notifying that the page is redirecting and no data is returned:

2023-05-04 20:18:33 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2023-05-04 20:18:33 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2023-05-04 20:18:33 [scrapy.core.engine] INFO: Spider opened
2023-05-04 20:18:33 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2023-05-04 20:18:33 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2023-05-04 20:18:36 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://web.archive.org/web/20200204105913/<some_url>> from <GET https://web.archive.org/web/20050313/<some_url>>

I've looked at other questions of similar nature and I understand I need to do something with the middleware, but those other questions were more about not allowing redirects, while I want the exact opposite.

How do I allow scrapy to follow redirects?


Solution

  • From the documentation link @beer provided, you need to enable the RedirectMiddleware.

    However, from the documentation :

    For example, if you want the redirect middleware to ignore 301 and 302 responses (and pass them through to your spider) you can do this:

    class MySpider(CrawlSpider):
        handle_httpstatus_list = [301, 302]
    

    This parameter is used to bypass the RedirectMiddleware for the given HTTP statuses. Try using the middleware without setting handle_httpstatus_list.