Search code examples
javascriptplaywrightcrawlee

PlaywrightCrawler: Reclaiming failed request back to the list or queue. requestHandler timed out after 60 seconds


The website I'm trying to scrape loads its content dynamically. The Playwright Crawler clicks the "Load" button to load more listings. It displays 30 listings per page. There are 2k+ listings, I want the scraper to click each phone number link to load it and then extract it. The script does so but I need to increase the timeout so the script can keep alive for infinite time.

Error:

PlaywrightCrawler: Reclaiming failed request back to the list or queue. requestHandler timed out after 60 seconds

    import { PlaywrightCrawler, Configuration, Dataset } from 'crawlee';
    
    const crawler = new PlaywrightCrawler({
        requestHandler: async ({ request, page, enqueueLinks, log }) => {
            
        try {
    
          const title = await page.title();
          log.info(`Title of ${request.loadedUrl} is '${title}'`);

          const curser_i = 0
          for(let i=curser_i; i< await curser_i+10; i++){
            const btn= await page.locator('#page_content div.je2-businesses-list div.je2-business-item__buttons').nth(i);
            await page.waitForTimeout(3000);
            const a = await btn.locator('a').click()
          }
    
        } 
        catch (error) {
            console.error("An error occurred:", error.message);
        } 
        finally {
            console.log("Finally block executed.");
        }
    
    
        },
        headless: false,
    
    })
    
    // Here we start the crawler on the selected URLs.
    await crawler.run(['https://www.jamesedition.com/offices?category=real_estate']);

Solution

  • The Playwright request handler has default timeout of 60 seconds – see PlaywrightCrawlerOptions API Docs here.

    In order to allow a single requestHandler call to run longer than 60 seconds, you have to set the timeout to an estimated handler run time in seconds.

    import { PlaywrightCrawler, Configuration, Dataset } from 'crawlee';
    
    const crawler = new PlaywrightCrawler({
        requestHandlerTimeoutSecs: 1800, // e.g. set requestHandler timeout to 30 minutes
        requestHandler: async ({ request, page, enqueueLinks, log }) => {
            // ...
        },
    })
    
    await crawler.run(['https://foo.bar']);
    

    Apparently, setting requestHandlerTimeoutSecs: 0 does not create an indefinite request handler timeout. In my understanding of the BasicCrawler class – inheritance chain PlaywrightCrawler > BrowserCrawler > BasicCrawler – setting the timeout to 0 ends up being the default timeout of 60 seconds. So, as far as I understand it you have to choose a reasonably safe timeout for the request handler. However, the maximum seems to be 2147483647. See here and here.