The website I'm trying to scrape loads its content dynamically. The Playwright Crawler clicks the "Load" button to load more listings. It displays 30 listings per page. There are 2k+ listings, I want the scraper to click each phone number link to load it and then extract it. The script does so but I need to increase the timeout so the script can keep alive for infinite time.
Error:
PlaywrightCrawler: Reclaiming failed request back to the list or queue. requestHandler timed out after 60 seconds
import { PlaywrightCrawler, Configuration, Dataset } from 'crawlee';
const crawler = new PlaywrightCrawler({
requestHandler: async ({ request, page, enqueueLinks, log }) => {
try {
const title = await page.title();
log.info(`Title of ${request.loadedUrl} is '${title}'`);
const curser_i = 0
for(let i=curser_i; i< await curser_i+10; i++){
const btn= await page.locator('#page_content div.je2-businesses-list div.je2-business-item__buttons').nth(i);
await page.waitForTimeout(3000);
const a = await btn.locator('a').click()
}
}
catch (error) {
console.error("An error occurred:", error.message);
}
finally {
console.log("Finally block executed.");
}
},
headless: false,
})
// Here we start the crawler on the selected URLs.
await crawler.run(['https://www.jamesedition.com/offices?category=real_estate']);
The Playwright request handler has default timeout of 60 seconds – see PlaywrightCrawlerOptions
API Docs here.
In order to allow a single requestHandler
call to run longer than 60 seconds, you have to set the timeout to an estimated handler run time in seconds.
import { PlaywrightCrawler, Configuration, Dataset } from 'crawlee';
const crawler = new PlaywrightCrawler({
requestHandlerTimeoutSecs: 1800, // e.g. set requestHandler timeout to 30 minutes
requestHandler: async ({ request, page, enqueueLinks, log }) => {
// ...
},
})
await crawler.run(['https://foo.bar']);
Apparently, setting requestHandlerTimeoutSecs: 0
does not create an indefinite request handler timeout. In my understanding of the BasicCrawler
class – inheritance chain PlaywrightCrawler
> BrowserCrawler
> BasicCrawler
– setting the timeout to 0 ends up being the default timeout of 60 seconds. So, as far as I understand it you have to choose a reasonably safe timeout for the request handler. However, the maximum seems to be 2147483647. See here and here.