Search code examples
web-scrapingcheerioapify

Apify cheerio scraper stops even with urls in the queue


here is the scenario, I'm using the cheerio scraper to scraper a website containing real estate announces.

Each announce has the link to the next announce so before scrapint the current page I add the next page in the request queue. What it happens always at certain and a random point is that the scraper stops without any reason, even if in the queue there is the next page to scrape (I add the image).

Why does this happens since there is still a pending request in the queue? Many thanks

Here is the message I get:

2021-02-28T10:52:35.439Z INFO  CheerioCrawler: All the requests from request list and/or request queue have been processed, the crawler will shut down.
2021-02-28T10:52:35.672Z INFO  CheerioCrawler: Final request statistics: {"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":963,"requestsFinishedPerMinute":50,"requestsFailedPerMinute":0,"requestTotalDurationMillis":22143,"requestsTotal":23,"crawlerRuntimeMillis":27584,"requestsFinished":23,"requestsFailed":0,"retryHistogram":[23]}
2021-02-28T10:52:35.679Z INFO  Cheerio Scraper finished.

Here the request queue:

request queue details

Here the code

async function pageFunction(context) {
    const { $, request, log } = context;

    // The "$" property contains the Cheerio object which is useful
    // for querying DOM elements and extracting data from them.
    const pageTitle = $('title').first().text();

    // The "request" property contains various information about the web page loaded. 
    const url = request.url;
    
    // Use "log" object to print information to actor log.
    log.info('Scraping Page', { url, pageTitle });

    // Adding next page to the queue
    var baseUrl = '...';
    if($('div.d3-detailpager__element--next a').length > 0)
    {
        var nextPageUrl = $('div.d3-detailpager__element--next a').attr('href');
        log.info('Found another page', { nextUrl: baseUrl.concat(nextPageUrl) });
        context.enqueueRequest({ url:baseUrl.concat(nextPageUrl) });
    }
    
    // My code for scraping follows here
    
    return { /*my scaped object*/}
 }


Solution

  • Missing await

    await context.enqueueRequest