Search code examples
node.jsmultithreadingworker-process

What is the best way to pass large amounts of data to a worker process?


I am working on a project that requires very CPU intensive work on a webserver. I have started using worker threads to do the work in order utilize multiple cores and not block requests as the work is done. The issue is that it involves indexing and comparing indices on large amounts of data, possibly several MB. I currently can think of 3 ways to do this:

  1. Use worker messages to pass all data. From what I know, it is cloning this data before passing it to the worker, which is rather expensive. Plus, it will require formatting the data before it is passed because it is using a Mongoose document object.

  2. Since all of the data is in the database, I could just fetch the data from the database. However, as far as I know, this means that I have to connect to the database, which means it will have to make that connection anew for every worker.

  3. Writing the data to a file, then reading it again from the worker. This seems like the worst option because it would have to write, then read all of the data from a file every time. Plus I would have to make sure that I write to unique files so that the worker doesn't read from another file that may be used by another worker.

The work I am doing is indexing, then comparing indices on large numbers of files. It often takes several seconds.

Which of these is going to be the best way to pass the index data to the worker? Or is there another way to do this that I am not thinking of?


Solution

  • The best solution that I found is using Redis. For those that don't know, (like I didn't) Redis is an in-memory database, and really easy to set up and use. Since it works like localStorage in the browser, you can only store simple data to it. So, I just used JSON.stringify/parse to pass strings. Since it is in memory, it can also be saved by the main process, then accessed by the worker. Here is an example:

    index.js:

    const {clientConnect} = require("redis");
    const client = clientConnect().connect();
    await client.set("myData", JSON.stringify(largeObject));
    client.disconnect();
    

    worker.js:

    const {clientConnect} = require("redis");
    const client = clientConnect().connect();
    let myData = JSON.parse(await client.get("myData"));
    client.disconnect();
    

    This works and is actually surprisingly fast. I tried with an object that had several thousand properties (generally containing numbers), and it took anywhere from 5 - 20ms. Similar speeds when reading/parsing on the worker end.

    This is obviously not the best solution, but it does work quite well. I think that the best solution would probably be using shared memory with SharedArrayBuffer. This way the worker can then just access the memory. However, Redis just seemed simpler and the best option for me.

    Honestly, the lesson that I learned from this is that JavaScript is probably not the best language for something like this. This is a task much better suited to something like C++ or Rust, where you can use pointers.