Search code examples
javascriptpuppeteerapify

Best way to push one more scrape after all are done


I have following scenario:

  • My scrapes are behind a login, so there is one login page that I always need to hit first
  • then I have a list of 30 urls that can be scraped asynchronously for all I care
  • then at the very end, when all those 30 urls have been scraped I need to hit one last separate url to put the results of the 30 URL scrape into a firebase db and to do some other mutations (like geo lookups for addresses etc)

Currently I have all 30 urls in a request queue (through the Apify web-interface) and I'm trying to see when they are all finished.

But obviously they all run async so that data is never reliable

 const queue = await Apify.openRequestQueue();
 let  pendingRequestCount  = await queue.getInfo();

The reason why I need that last URL to be separate are two-fold:

  1. Most obvious reason being that I need to be sure I have the results of all 30 scrapes before I send everything to DB
  2. neither of the 30 URL's allow me to do Ajax / Fetch calls, which I need for sending to Firebase and do the geo lookups of addresses

Edit: Tried this based on answer from @Lukáš Křivka. handledRequestCount in the while loop reaches a max of 2, never 4 ... and Puppeteer just ends normally. I've put the "return" inside the while loop because otherwise requests never finish (of course).

In my current test setup I have 4 urls to be scraped (in the Start URLS input fields of Puppeteer Scraper (on Apify.com) and this code :

let title = "";
const queue = await Apify.openRequestQueue();
let {handledRequestCount} = await queue.getInfo();
while (handledRequestCount < 4){
    await new Promise((resolve) => setTimeout(resolve, 2000)) // wait for 2 secs
    handledRequestCount = await queue.getInfo().then((info) => info.handledRequestCount);
    console.log(`Curently handled here: ${handledRequestCount} --- waiting`) // this goes max to '2'
    title = await page.evaluate(()=>{ return $('h1').text()});
    return {title};
}
log.info("Here I want to add another URL to the queue where I can do ajax stuff to save results from above runs to firebase db");
title = await page.evaluate(()=>{ return $('h1').text()});
return {title};

Solution

  • Because I was not able to get consistent results with the {handledRequestCount} from getInfo() (see my edit in my original question), I went another route.

    I'm basically keeping a record of which URL's have already been scraped via the key/value store.

     urls = [
       {done:false, label:"vietnam", url:"https://en.wikipedia.org/wiki/Vietnam"},
       {done:false , label:"cambodia", url:"https://en.wikipedia.org/wiki/Cambodia"}
     ]
    
     // Loop over the array and add them to the Queue
     for (let i=0; i<urls.length; i++) {
       await queue.addRequest(new Apify.Request({ url: urls[i].url }));
     }
    
     // Push the array to the key/value store with key 'URLS'
     await Apify.setValue('URLS', urls);
    

    Now every time I've processed an url I set its "done" value to true. When they are all true I'm pushing another (final) url into the queue:

     await queue.addRequest(new Apify.Request({ url: "http://www.placekitten.com" }));