Search code examples
algorithmpacking

Efficiently removing empty spaces within set of numbers


I'll use Python syntax and objects to represent the problem, but in reality it's meant for a model in SQL databases, with a Python API and ORM.

I have a list of numbers like this:

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

From time to time, some numbers get removed and null spaces remain:

[0, 1, 2, None, None, 5, 6, None, None, None, 10]

What I need to do is to efficiently pack this set of numbers on a maintenance step done periodically, both in ordered and unordered fashion, such that no null spaces remain between numbers:

So, in ordered fashion I need that list to become:

[0, 1, 2, 5, 6, 10, None, None, None, None, None]

And when unordered it doesn't really matter where each number goes, as long as there are no null spaces between them.

Numbers can be moved in contiguous blocks, and moving them any number of places to left or right costs the same, but there's a setup and teardown cost which makes it a lot more efficient to move larger blocks and achieve it in as few updates as possible.

Right now I'm using the most simple solution, finding blocks of contiguous numbers and moving them to the nearest left one block at a time until it's packed. So, in the example, 5, 6 is moved 2 blocks left in a single update, and then 10 is moved 5 blocks to left in another update.

[0, 1, 2, None, None, 5, 6, None, None, None, 10]

[0, 1, 2, 5, 6, None, None, None, None, None, 10]

[0, 1, 2, 5, 6, 10, None, None, None, None, None]

This trivial approach seems to be the most efficient when order matters, but in reality most of my operations will be unordered and I think there should be a better approach. For instance, in this case, the list can be packed in a single update by moving the 0, 1, 2 block between 6 and 10:

[None, None, None, None, None, 5, 6, 0, 1, 2, 10]

In reality there will be thousands of blocks, but I know beforehand the size of each block and each gap. Moving blocks is also very expensive compared to the computation needed for combinatorics between their size and the gaps, so finding the optimal solution is the ideal.

This seems a kind of bin packing problem, but I really don't know how to approach it to find the best solution. Any ideas?


Solution

  • For the unordered case, suppose that somebody tells you what spaces the final contiguous block should fill. Then one heuristic is to suppose that if you move in the largest blocks outside this region into it first, then everything will fit and you don't have to break any blocks up. As suggested in the comment, you could run A* (or branch and bound) with this. Then your first decision is where the final contiguous block should be, but this is just another level of A*/branch and bound - in fact under this heuristic, the most promising final contiguous region will be the one currently holding the largest number of filled in sub-regions, because you are assuming that you only have to move in sub-regions outside this.

    If you do find this is too expensive, one way to speed up branch and bound, at the cost of getting poorer answers, is to discard possible answers that could improve the best answer found so far by only X%, for some X.

    Actually I think you can get a slightly better lower bound than this - max(number of separate contiguous gaps in the target area, number of separate contiguous areas to be moved in from source area) should be slightly better as one move can at best move in a single contiguous area of numbers and fill a single gap in the target area.

    One easy way to get a lower bound is to ignore enough constraints on the problem to make it easy. Assuming that the unknown correct answer is still a feasible solution this must give you a lower bound, because a best solution on the weakened problem must be at least as good as the unknown correct answer. You can apply this to your problem with gappy updates by pretending that two updates will never collide with each other. Given a specified target area, computing this heuristic amounts to finding a best way to cut up the source area into chunks, each of which fit into the target area. You could solve this with a dynamic program: you work out best answer for the first n+1 cells of the source area by considering all possible ways of copying in the last k cells of the source area, and then adding on the cost of copying in the first n+1-k cells of the source area, which you will have already worked out. Unfortunately, I have no idea of whether this heuristic is strong enough to be useful.