Search code examples
pythonweb-scrapingscrapyscraper

Scrapy scrapes one page 'n' times but other single time when in a loop


I am scraping two pages for a single id iteratively. First scraper works for all id's but the second one works for only one id.

class MySpider(scrapy.Spider):
  name = "scraper"
  allowed_domains = ["example.com"]
  start_urls = ['http://example.com/viewData']

  def parse(self, response):
    ids = ['1', '2', '3']

    for id in ids:
      # The following method scraps for all id's
      yield scrapy.Form.Request.from_response(response,
                                                   ...
                                              callback=self.parse1)

      # The following method scrapes only for 1st id
      yield Request(url="http://example.com/viewSomeOtherData",
                    callback=self.intermediateMethod)

  def parse1(self, response):
    # Data scraped here using selectors

  def intermediateMethod(self, response):
    yield scrapy.FormRequest.from_response(response,
                                                ...
                                           callback=self.parse2)

  def parse2(self, response):
    # Some other data scraped here

I want to scrap two different pages for a single id.


Solution

  • Changing the following line:

    yield Request(url="http://example.com/viewSomeOtherData",
                  callback=self.intermediateMethod)
    

    to:

    yield Request(url="http://example.com/viewSomeOtherData",
                  callback=self.intermediateMethod,
                  dont_filter=True)
    

    worked for me.

    Scrapy has a duplicate URL filter, it's possible this is filtering your Request. Try adding dont_filter = True afer the callback as suggested by Steve.