algorithm optimization dynamic-programming modeling bin-packing

Bin packing parts of a dynamic set, considering lastupdate

There's a large set of objects. Set is dynamic: objects can be added or deleted any time. Let's call the total number of objects N.

Each object has two properties: mass (M) and time (T) of last update.

Every X minutes a small batch of those should be selected for processing, which updates their T to current time. Total M of all objects in a batch is limited: not more than L.

I am looking to solve three tasks here:

find a next batch object picking algorithm;
introduce object classes: simple, priority (granted fit into at least each n-th batch) and frequent (fit into each batch);
forecast system capacity exhaust (time to add next server = increase L).

What kind of model best describes such a system?

The whole thing is about a service that processes the "objects" in time intervals. Each object should be "measured" each N hours. N can vary in a range. X is fixed.

Objects are added/deleted by humans. N grows exponentially, rather slow, with some spikes caused by publications. Of course forecast can't be precise, just some estimate. M varies from 0 to 1E7 with exponential distribution, most are closer to 0.

I see there can be several strategies here:

A. full throttle - pack each batch as much as close to 100%. As N grows, average interval a particular object gets a hit will grow.

B. equal temperament :) - try to keep an average interval around some value. A batch fill level will be growing from some low level. When it reaches closer to 100% – time to get more servers.

C. - ?

Solution

Here is a pretty complete design for your problem.

Your question does not optimally match your description of the system this is for. So I'll assume that the description is accurate.

When you schedule a measurement you should pass an object, a first time it can be measured, and when you want the measurement to happen by. The object should have a weight attribute and a measured method. When the measurement happens, the measured method will be called, and the difference between your classes is whether, and with what parameters, they will reschedule themselves.

Internally you will need a couple of priority queues. See http://en.wikipedia.org/wiki/Heap_(data_structure) for details on how to implement one.

The first queue is by time the measurement can happen, all of the objects that can't be measured yet. Every time you schedule a batch you will use that to find all of the new measurements that can happen.

The second queue is of measurements that are ready to go now, and is organized by which scheduling period they should happen by, and then weight. I would make them both ascending. You can schedule a batch by pulling items off of that queue until you've got enough to send off.

Now you need to know how much to put in each batch. Given the system that you have described, a spike of events can be put in manually, but over time you'd like those spikes to smooth out. Therefore I would recommend option B, equal temperament. So to do this, as you put each object into the "ready now" queue, you can calculate its "average work weight" as its weight divided by the number of periods until it is supposed to happen. Store that with the object, and keep a running total of what run rate you should be at. Every period I would suggest that you keep adding to the batch until one of three conditions has been met:

You run out of objects.
You hit your maximum batch capacity.
You exceed 1.1 times your running total of your average work weight. The extra 10% is because it is better to use a bit more capacity now than to run out of capacity later.

And finally, capacity planning.

For this you need to use some heuristic. Here is a reasonable one which may need some tweaking for your system. Maintain an array of your past 10 measurements of running total of average work weight. Maintain an "exponentially damped average of your high water mark." Do that by updating each time according to the formula:

average_high_water_mark = 0.95 * average_high_water_mark + 0.5 * max(last 10 running work weight)

If average_high_water_mark ever gets within, say, 2 servers of your maximum capacity, then add more servers. (The idea is that a server should be able to die without leaving you hosed.)