Search code examples
node.jspuppeteerjob-scheduling

Puppeteer Chromium instance management


So I have seen the puppeteer-cluster package but that has very manual examples my situation is very dynamic so i'll try my best to explain.

Ok So I have an app in which users schedule posts. Once the time of posting arrives puppeteer runs, goes to the site, logs in the user using creds from my app's db, and posts the content fairly simple.

Now the problem arises when says 20 users all decided to post today at 1pm. Now puppeteer spawns 25 chromium instances which messess with the server cause of limited RAM. What I am asking basically is how can achieve the following: 1). Limit puppeteer's concurrency to 10 instances. Any more then that then it should basically do it in batches like do 10 first then close them and start 10 again etc. 2). If less then 10 then just keep normal functionality.

I know this seems like I m giving you homework but trust me i just need some guidance a little help or pointing me in the right direction would suffice. or if you could tell me how to use this: puppeteer-cluster dynamically to suit my needs. Many thanks!


Solution

    1. First of all, You need to have an advance messaging queueing system to capture all the incoming concurrent requests like Kafka / RabbitMQ
    2. Get the messages in chunks of 10 requests and run a for loop on these chunks and each loop creating one cluster per chunk.
    3. The following code explains how you can accomplish it, this piece of code answers all your questions listed.

    Code snippet:

    const { Cluster } = require('puppeteer-cluster');
    const runChunks = async (chunkArr, chunkSize) => {
        //Launching cluster for each chunk
        const cluster = await Cluster.launch({
            concurrency: Cluster.CONCURRENCY_CONTEXT,
            maxConcurrency: chunkSize, //Defined max chunksize
        });
        //Task to complete
        await cluster.task(async ({ page, data: url }) => {
            await page.goto(url);
            console.log('Reached: ', url);
            // Here goes the code for scraping task to complete ...
        });
        //Chunked array URLs queued for task completion
        chunkArr.forEach(data => {
            cluster.queue(data.url);
        });
        //Closing the cluster instance after it becomes idle
        await cluster.idle();
        await cluster.close();
    };
    
    function chunkArrGenerator(arr, chunkSize) {
        let chunksArr = [];
        let indexCounter = 0;
        while (indexCounter <= (arr.length - 1)) {
            chunksArr.push(arr.splice(0, chunkSize));
            indexCounter++;
        }
        return chunksArr;
    }
    
    // assume request array having 100 objects with url data
    let arr = [{ url: "https://www.amazon.in/" }, { url: "https://www.flipkart.com/" }, { url: "https://www.crateandbarrel.com/" }, { url: "https://www.cb2.com/" } /* so on ... */];
    
    let size = 2;  //chunk size
    //Following line creates chunks of size 2, you change it to 10 as per your need
    let chunks = chunkArrGenerator(arr, size);
    //Executing each cluster on each chunk
    chunks.forEach(async (chunk) => {
        await runChunks(chunk, size);
    });