Search code examples
javascriptweb-scrapingpuppeteergoogle-chrome-headless

Crawling multiple URLs in a loop using Puppeteer


I have an array of URLs to scrape data from:

urls = ['url','url','url'...]

This is what I'm doing:

urls.map(async (url)=>{
  await page.goto(url);
  await page.waitForNavigation({ waitUntil: 'networkidle' });
})

This seems to not wait for page load and visits all the URLs quite rapidly (I even tried using page.waitFor).

I wanted to know if am I doing something fundamentally wrong or this type of functionality is not advised/supported.


Solution

  • map, forEach, reduce, etc, does not wait for the asynchronous operation within them, before they proceed to the next element of the iterator they are iterating over.

    There are multiple ways of going through each item of an iterator synchronously while performing an asynchronous operation, but the easiest in this case I think would be to simply use a normal for operator, which does wait for the operation to finish.

    const urls = [...]
    
    for (let i = 0; i < urls.length; i++) {
        const url = urls[i];
        await page.goto(`${url}`);
        await page.waitForNavigation({ waitUntil: 'networkidle2' });
    }
    

    This would visit one url after another, as you are expecting. If you are curious about iterating serially using await/async, you can have a peek at this answer: https://stackoverflow.com/a/24586168/791691