Search code examples
node.jsexpressherokuconcurrencyprocess

Understanding Node.js concurrent process and variable scope


I have an Express.js app that I am now adding Node cluster support to with Throng. I haven't managed to wrap my head entirely around how processes should work with clusters. I'm specifically confused about what should be included in my start function.

For example.

// Setup Bull producers
const Bull = require('bull');
const urgentQueue = new Bull('urgent-queue', REDIS_URL);
const normalQueue = new Bull('normal-queue', REDIS_URL);

// Express app using Slack Bolt framework
const {App, ExpressReceiver} = require('@slack/bolt');
const expressReceiver = new ExpressReceiver({
    signingSecret: process.env.SLACK_SIGNING_SECRET,
    endpoints: '/slack/events'
});

const app = new App({
    authorize: helpers.authorize,
    receiver: expressReceiver
});

// Other functions go here, redacted for example


// Schedule our repeating Bull jobs in our master process
const startMaster = async() => {
    await normalQueue.add('update-presence', {repeat: {cron: '0,15,30,45 * * * *'}});
    await urgentQueue.add('nightly-billing', {repeat: {cron: '0 0 * * *'}});
    await urgentQueue.add('process-trials', {repeat: {cron: '30 0 * * *'}});
    console.log('⚡️ Master process has run!');
};

// Our workers can spawn multiple apps
// QQQ should all of this go inside of the start function (including the express definitions above?)
const startWorker = async () => {
    await app.start(process.env.PORT || 5000);
    console.log('⚡️ Visual Office  app is running!');
};

// Launch with concurrency support
const throng = require('throng');
var WORKERS = process.env.WEB_CONCURRENCY || 1;
throng({
    workers: WORKERS,
    lifetime: Infinity,
    master: startMaster,
    start: startWorker
});
console.log(`launched with ${WORKERS} concurrent processes`)

My question essentially boils down to variable scoping. With the example code above, it references constants that are defined outside of the master or start functions. Most examples I've seen show complete express apps inside of the start function, rather than simply calling the app.start function inside of the start worker function.

Is this because referencing it this way defeats the purpose? As in the process workers are referencing the same object in memory, despite being in separate processes?


Solution

  • Your example shows the use of the throng module. Looking at its code, it uses the nodejs cluster module. I can offer some explanation both in terms of the built-in nodejs cluster module (which should apply to throng) or in terms or the built-in worker_threads module.

    cluster module

    The cluster module starts completely separate processes. They don't share any memory or any variables, though they are configured so they can send socket handles from one process to another or they can send variables to each other which will be copied. If you define constants in your cluster startup code, those same constants will be used in each cluster (just because they each run the same startup code and thus each initialize themselves the same). But, if you have a global or a module level variable that you intend to change, whatever change you make will be entirely local to that process and not reflected in the others.

    worker_thread module

    The worker_thread module lets you start one or more threads in the same process. Each thread is its own V8 interpreter and they don't, by default, share any variables or memory (global or module). Variables that you pass between the main thread and a worker thread, either via workerData or via postMessage() are copied to be separate variables in the receiving thread. Objects are copied with a structured cloning process.

    But, they are in the same process so doing something like calling process.exit() from a worker thread will exit the entire process (you can separately kill just the thread if you want).

    Worker threads can share memory by allocating memory as a SharedArrayBuffer which then the main thread and all workers can all access. There is no automatic synchronization of access to that memory so expected multi-threading concurrency issues can occur unless you either use some of the synchronization primitives that nodejs now offers or unless you have a design such that only one thread at a time ever has a reference to that shared memory (that's what I've been doing in my recent app - passing a sharedArrayBuffer to a specific worker for it to carry out an operation on and then it passes it back when it's done with the job and the main thread does not keep a reference to it while the worker is working on it).

    Memory shared in this way isn't a set of variables, it's a buffer of memory, but you can interpret the memory however you want to make it into specific meaningful variables.

    Note that shared memory isn't automatically available to every thread, but instead you have to pass it to the thread, either in the initial workerData or using postMessage() before it gets a reference to it. sharedMemory is literally allocated from a different heap. I believe it was first developed by V8 for use with webWorkers in the browser, but works now in node.js too.