Search code examples
python-3.xscrapyweb-crawlerplaywright-pythonscrapy-playwright

Scrapy callback not executed when using Playwright for JavaScript rendering


I'm using Scrapy with the Playwright plugin to crawl a website that relies on JavaScript for rendering. My spider includes two asynchronous functions, parse_categories and parse_product_page.

The parse_categories function checks for categories in the URL and sends requests to the parse_categories callback again until a product page is found which should be when no categories are found. If no categories are found, it should send a request to the parse_product_page callback.

However, when it reaches the else block in parse_categories, it seems that the request to parse_product_page is never made. I've confirmed that the code enters the else block, but the print statement in the parse_product_page function is never reached.

Here is my reprex:

import scrapy
from scrapy_playwright.page import PageMethod

class Spider():
    name = "quotes"
    allowed_domains = ['quotes.toscrape.com']
  
    def start_requests(self):
        yield scrapy.Request(url='https://quotes.toscrape.com/js/', callback=self.parse_urls, 
              meta=dict(
                   playwright = True, 
                   playwright_include_page = True,
                   playwright_page_methods = [
                         PageMethod('wait_for_selector','body > div > nav > ul > li > a')
                        ],
                   ))
    

    async def parse_urls(self, response):
        page = response.meta['playwright_page']
        await page.close()
        
        next_page_url = response.xpath('//li[@class="next"]/a/@href').get()

        if next_page_url:
            print("Inside if block")
            url = 'https://quotes.toscrape.com' + next_page_url
            yield scrapy.Request(url=url,callback=self.parse_urls,
                meta=dict(
                    playwright = True,
                    playwright_include_page = True,
                    playwright_page_methods = [
                        PageMethod('wait_for_selector','body > div > div.quote')]
                        ))
        else:
            print("Next page link not found")
            yield scrapy.Request(url=response.request.url, callback=self.parse, 
                    meta=dict(
                        playwright = True,
                        playwright_include_page = True,
                        playwright_page_methods = [
                            PageMethod('wait_for_selector','body > div > div.quote')]
                        ))


    async def parse(self,response):
        page = response.meta['playwright_page']
        await page.close()
        print("Function has been called, because next page link not found")

This is the logs from the reprex:

Inside if block
Inside if block
Inside if block
Inside if block
Inside if block
Inside if block
Inside if block
Inside if block
Inside if block
Next page link not found
2023-04-11 09:47:04 [root] WARNING: spider quotes finished crawling

Solution

  • This issue has been fixed by adding the parameter dont_filter = True to the yield scrapy.Request in the else block.

    else:
        yield scrapy.Request(url=response.request.url,
              callback=self.parse, 
              dont_filter=True,
              meta=dict(
                   playwright = True,
                   playwright_include_page = True,
                   playwright_page_methods = [
                   PageMethod('wait_for_selector','body > div > div.quote')]
                ))