Search code examples
pythonscrapy

How to re-scrape a page if there is an error in parse method?


The first action in my parse method is to extract a dictionary from a JSON string contained in the HTML. I've noticed that I sometimes get an error as the web page doesn't display correctly and thus doesn't contain the JSON string. If I rerun the spider then the same page displays fine and on it carries on until another random JSON error.

I'd like to check that I've got the error handling correct:

def parse(self, response):
    json_str = response.xpath("<xpath_to_json>").get()
    try:
        items = json.loads(json_str)["items"]
    except JSONDecodeError:
        return response.follow(url=response.url, callback=self.parse)
    for i in items:
        # do stuff

I'm pretty sure this will work ok but wanted to check check a couple of things:

  1. If this hits a 'genuinely bad' page where there is no JSON will the spider get stuck in a loop or does scrapy give up after trying a given URL a certain number of times?
  2. I've used a return instead of a yield because I don't want to continue running the method. Is this ok?

Any other comments are welcome too!!


Solution

  • I think return when getting decoding error in your case should be ok as the scraper is not iterating through the scraped results. I think normally response.follow and Request would filter out duplicated requests so you would need to include dont_filter=True when calling them to allow duplicated url requests. To configure a n number of retry, it's not the cleanest approach but you could keep a dictionary to keep track of retry attempt counts for certain url as self property (self.retry_count in below code), increase it every time the url request is parsed and stop when a limit number is hit.

    import json
    from json import JSONDecodeError
    import scrapy
    
    
    class TestSpider(scrapy.Spider):
        name = "test"
    
        def start_requests(self):
            urls = [
                "https://quotes.toscrape.com/page/1/",
                "https://quotes.toscrape.com/page/2/"
            ]
            for url in urls:
                self.retry_count = {k:0 for k in urls}
                self.retry_limit = 3
                yield scrapy.Request(url=url, callback=self.parse, dont_filter=True)
    
        def parse(self, response):
            self.retry_count[response.url] += 1
            json_str = "{\"items\": 1" # intentionally trigger json decode error
            print(f'===== RUN {response.url}; Attempt: {self.retry_count} =====')
            try:
                items = json.loads(json_str)["items"]
            except JSONDecodeError as ex:
                print("==== ERROR ====")
                if self.retry_count[response.url] == self.retry_limit:
                    raise ex
                else:
                    return response.follow(url=response.url, callback=self.parse, dont_filter=True)
            
            self.retry_count[response.url] = 0 # reset attempt as parse successful