javascript architecture node.js scale beanstalkd

Database Backed Work Queue

My situation ...

I have a set of workers that are scheduled to run periodically, each at different intervals, and would like to find a good implementation to manage their execution.

Example: Let's say I have a worker that goes to the store and buys me milk once a week. I would like to store this job and it's configuration in a mysql table. But, it seems like a really bad idea to poll the table (every second?) and see which jobs are ready to be put into the execution pipeline.

All of my workers are written in javascript, so I'm using node.js for execution and beanstalkd as a pipeline.

If new jobs (ie. scheduling a worker to run at a given time) are being created asynchronously and I need to store the job result and configuration persistently, how do I avoid polling a table?

Thanks!

Solution

I agree that it seems inelegant, but given the way that computers work something *somewhere* is going to have to do polling of some kind in order to figure out which jobs to execute when. So, let's go over some of your options:

Poll the database table. This isn't a bad idea at all - it's probably the simplest option if you're storing the jobs in MySQL anyway. A rate of one query per second is nothing - give it a try and you'll notice that your system doesn't even feel it.

Some ideas to help you scale this to possibly hundreds of queries per second, or just keep system resource requirements down:
- Create a second table, 'job_pending', where you put the jobs that need to be executed within the next X seconds/minutes/hours.
- Run queries on your big table of all jobs only once in a longer while, then populate the small table which you query every shorter while.
- Remove jobs that were executed from the small table in order to keep it small.
- Use an index on your 'execute_time' (or whatever you call it) column.
If you have to scale even further, keep the main jobs table in the database, and use the second, smaller table I suggest, just put that table in RAM: either as a memory table in the DB engine, or in a Queue of some kind in your program. Query the queue at extremely short intervals if you have too - it'll take some extreme use cases to cause any performance issues here.

The main issue with this option is that you'll have to keep track of jobs that were in memory but didn't execute, e.g. due to a system crash - more coding for you...
Create a thread for each of a bunch of jobs (say, all jobs that need to execute in the next minute), and call thread.sleep(millis_until_execution_time) (or whatever, I'm not that familiar with node.js).

This option has the same problem as no. 2 - where you have to keep track job execution for crash recovery. It's also the most wasteful imo - every sleeping job thread still takes system resources.

There may be additional options of course - I hope that others answer with more ideas.

Just realize that polling the DB every second isn't a bad idea at all. It's the most straightforward way imo (remember KISS), and at this rate you shouldn't have performance issues so avoid premature optimizations.