Search code examples
pythonalgorithmdistributedpowerset

Distributed Powerset


Considering the powerset operation (generate all possible subsets of a given set) and its massiveness (time complexity of O(n*2^n) ), I'm trying to scale it horizontally (distributed solution). Don't know if this is easily achievable (hence the question), but I'll try to break down the problem and make it as clear as possible.

Consider the following example using python:

import itertools

s = [1, 2, 3, 4, 5]

for l in range(1, len(s)+1):   # this can be distributed

    for subset in itertools.combinations(s, l):
        print(subset)

It's possible (and easy) to distribute the workload based on the subsets length. For instance, if we have a set of length 5, we can make each worker calc all subsets of length N - in this case we would have 5 workers. Why this doesn't appeal to me is quite obvious - workload distribution is not balanced at all. A set of length 20 will generate 184756 subsets of length 10 and only 20 subsets of length 1 (this means that the middle workers will always have a lot more processing to do).

Question

Is there a way to distribute the workload linearly in this case, and how? Rephrasing the problem - for a set of length L can I distribute the work to compute the powerset using N well balanced workers?


Solution

  • If you use n bits of an integer to represent the items in a subset of n items, you can start the variable at 0, and increment it to get to the next subset. So to evenly distribute the work between k processors, you can simply have processor #i start its integer variable at i and add k to it on each step. Every subset will be processed by exactly one processor.

    Bear in mind that this won't do very much to help you solve large problems. If you can solve a problem of size x on a single computer (and I'd estimate 20 <= x <= 30 on today's computers, roughly), then even by buying 1024 computers you would only be able to solve a problem of size x+10.