Search code examples
javascriptnode.jsweb-scrapingapify

Apify - How to Include Failed Results in Dataset


We are using the Apify Web Scraper actor to create a URL validation task that returns the input URL, the page's title, and the HTTP response status code. We have a set of 5 test URLs we are using: 4 valid, and 1 non-existent. The successful results are always included in the dataset, but never the failed URL.

Logging indicates that the pageFunction is not even reached for the failed URL:

2021-05-05T14:50:08.489Z ERROR PuppeteerCrawler: handleRequestFunction failed, reclaiming failed request back to the list or queue {"url":"http://www.invalidurl.com","retryCount":1,"id":"XS9JTk8dYRM8bpM"}
2021-05-05T14:50:08.490Z   Error: gotoFunction timed out after 30 seconds.
2021-05-05T14:50:08.490Z       at PuppeteerCrawler._handleRequestTimeout (/home/myuser/node_modules/apify/build/crawlers/puppeteer_crawler.js:387:15)
2021-05-05T14:50:08.496Z       at PuppeteerCrawler._handleRequestFunction (/home/myuser/node_modules/apify/build/crawlers/puppeteer_crawler.js:329:26)

Eventually it times out, based on our settings:

2021-05-05T14:50:42.052Z ERROR Request http://www.invalidurl.com failed and will not be retried anymore. Marking as failed.
2021-05-05T14:50:42.052Z Last Error Message: Error: gotoFunction timed out after 30 seconds.

I tried wrapping the code in the pageFunction in a try/catch block, but again, because the pageFunction is not getting reached for the invalid URL, that does not resolve the issue. Is there a way to still include the failed result in the dataset with a hard-coded response status code of '000'? (See pageFunction code below.) Please let me know if I can provide any additional information, and thanks in advance!

async function pageFunction(context) {
    context.log.info("Starting pageFunction");
    // use jQuery as $
    const { request, jQuery: $ } = context;
    const { url } = request;
    context.log.info("Trying " + url);
    let title = null;
    let responseCode = null;

    try {
        context.log.info("In try block for " + url);
        title = $('title').first().text().trim();
        responseCode = context.response.status;

    } catch (error) {
        context.log.info("EXCEPTION for " + url);
        title = "";
        responseCode = "000";
    
    } 

    return {
        url,
        title,
        responseCode
    };
    
}
    

Solution

  • you can use https://sdk.apify.com/docs/typedefs/puppeteer-crawler-options#handlefailedrequestfunction:

    handleFailedRequestFunction

    you can then push it to the when all retries fail:

    handleFailedRequestFunction: async ({ request }) => {
       // failed all retries
       await Apify.pushData({ url: request.url, responseCode: '000' });
    }