Search code examples
pythonsetcombinatorics

How to create N subsets with x0,..,xn elements, without replacement, so that each subset has the mean of superset?


I'm looking for advice what to google/efficient ways how to sample without replacement a superset containing M elements into N subsets with x0,...,xn elements, where the sum of all elements of subsets equals M (sum(x0,...,xn) == M), so that mean of every subset is as close to the superset's mean as possible. The elements are real numbers (floats; positive floats like 100.24 in most cases).

Example 1 (perfect allocation):

Superset contains:

- "100" 5 times
- "101" 10 times
- "102" 5 times
mean = 101
Superset = {100, 100, 100, 100, 100, 101, 101, 101, 101, 101, 101, 101, 101, 101, 101, 102, 102, 102, 102, 102}

Create 3 subsets containing: 2, 4, 14 elements:

A = {100, 102}
B = {100, 102, 100, 102}
C = {100, 100, 101, 101, 101, 101, 101, 101, 101, 101, 101, 101, 102, 102}

Every subset has a mean of 101.

Example 2 (best fit - perfect not possible):

Superset contains:

- "100": 5 times
- "103": 10 times
- "104": 5 times
mean = 102.5
Superset = {100, 100, 100, 100, 100, 103, 103, 103, 103, 103, 103, 103, 103, 103, 103, 104, 104, 104, 104, 104}

Create 3 subsets containing: 2, 4, 14 elements:

A = {100, 104}, mean = 102
B = {100, 103, 103, 104}, mean = 102.5
C = {100, 100, 100, 103, 103, 103, 103, 103, 103, 103, 103, 104, 104, 104}, mean = 102.5714 

OR 

A = {100, 104}, mean = 102
B = {100, 103, 104, 104}, mean = 102.75
C = {100, 100, 100, 103, 103, 103, 103, 103, 103, 103, 104, 104, 104, 104}, mean = 102.5

I think the most "fair" way is to describe the error function as the sum of subsets' divergences from the superset's mean and to optimize for minimization of such error. Therefore for example 2, 1st solution is better than 2nd.

I'm looking in general what field in math/CS covers such problems so that I can read on existing solutions as well as any advice on how to best do it/how you would do it.

In my case there is also a "rough" constraint that regardless of input data, the algo should run under 1 second on a "everyday machine". In regards to input data - in majority of cases the number of subsets will be 10-25, upper limit is 100. The amount of elements in superset has no upper limit, however in most cases there will be no more than 100-1000 unique elements. Let's say upper limit for unique elements is 10000, but this is an edge case.

So far I got to these conclusions: Depending on how many unique numbers there are in superset and how many subsets there are, the case is more or less computationally demanding due to the amount of possible combinations.

For this reason I would employ different algorithms for different sets:

  1. Computationally light - find all possible combinations and choose best fit.
  2. Computationally medicore - Starting from the smallest subset, allocate as good as possible, set by set - This is not "fair", but will result in small general error, as subsets with big amount of elements will always be 'relatively' close to the superset's mean even with random numbers.
  3. Computationally Demanding - Pre-allocate the subsets to 50-75% of their sizes by sampling the superset uniformly (this will make every subset close to original mean) and then employ algorithm 2) to get the subsets as close to original mean as possible. The pre-allocation size would depend on the ratio of number of subsets to the amount of unique numbers and elements in superset.

I haven't found a good way to reduce a big problem into smaller problem and then expand the solution back to original problem. I have though about doing the following:

Superset contains:
- "100": 100 times
- "103": 200 times
- "104": 500 times
Create 3 subsets containing: 200, 250, 350 elements

Reduce (50x) to:

Superset contains:
- "100": 2 times
- "103": 4 times
- "104": 10 times
Create 3 subsets containing: 4, 5, 7 elements.

But the obtainable results are worse then when solving for original case and I think uniform pre-allocation will yield better results then this "size reduction" method.

Any advice is appreciated. I will code this in Python(Numpy). Please add any tags that are appropriate.


Solution

  • IIUC, you should google for set partitioning problem. Here is an example solution using Pulp (link):

    from statistics import mean
    
    import pulp
    
    superset = [100, 100, 100, 100, 100, 101, 101, 101, 101, 101, 101, 101, 101, 101, 101, 102, 102, 102, 102, 102]
    
    target_sum = sum(superset)
    set_sizes = [2, 4, 14]
    N = len(set_sizes)
    
    assert sum(set_sizes) == len(superset)
    
    covering = {}
    for s in range(N):
        vals = []
        for i, v in enumerate(superset):
            vals.append(
                pulp.LpVariable(
                    f"covering_set_{s}_value_idx_{i:>02}_val_{v}",
                    lowBound=0.0,
                    upBound=1.0,
                    cat=pulp.LpInteger,
                )
            )
        covering[s] = vals
    
    set_partitioning_model = pulp.LpProblem("Set_Partitioning_Model", pulp.LpMinimize)
    
    abs_sum_errs = []
    for s_i, st in enumerate(covering.values()):
        set_sum_err_abs = pulp.LpVariable(f"set_{s_i}_sum_error_abs")
        abs_sum_errs.append(set_sum_err_abs)
    
    # OBJECTIVE: minimize sum of absolute errors
    set_partitioning_model += pulp.lpSum(abs_sum_errs)
    
    for s_i, st in enumerate(covering.values()):
        set_sum_err = pulp.LpVariable(f"set_{s_i}_sum_error")
        set_partitioning_model += set_sum_err == (
            target_sum - (pulp.lpSum([p * superset[i] for i, p in enumerate(st)]))
        )
        # ABS VALUE CONSTRAINTS
        # https://stackoverflow.com/questions/59233699/mathematical-operation-of-abs-in-objective-function-of-pulp/59566067#59566067
        set_partitioning_model += abs_sum_errs[s_i] >= set_sum_err
        set_partitioning_model += abs_sum_errs[s_i] >= -set_sum_err
    
    # set sizes are predetermined
    for n, st in zip(set_sizes, covering.values()):
        set_partitioning_model += pulp.lpSum(st) == n
    
    # every value can be used only once in set:
    for s_i, st in enumerate(zip(*covering.values())):
        set_partitioning_model += (
            pulp.lpSum(st) == 1,
            f"value {s_i} must be used only once",
        )
    
    set_partitioning_model.solve()
    
    means = []
    for k, v in covering.items():
        lst = [superset[idx] for idx, i in enumerate(v) if i.value() == 1]
        print(lst, mean(lst))
    

    The solution:

    [101, 101] 101
    [100, 100, 102, 102] 101
    [100, 100, 100, 101, 101, 101, 101, 101, 101, 101, 101, 102, 102, 102] 101
    

    superset = [100, 100, 100, 100, 100, 103, 103, 103, 103, 103, 103, 103, 103, 103, 103, 104, 104, 104, 104, 104]
    

    The solution is:

    [103, 103] 103
    [100, 100, 104, 104] 102
    [100, 100, 100, 103, 103, 103, 103, 103, 103, 103, 103, 104, 104, 104] 102.57142857142857
    

    Karmarkar-Karp algorithm

    Note: (Just for info here - I'm not sure how you can specify set sizes here)

    For heuristics, you can google for Karmarkar-Karp algorithm. Here is an example using numberpartitioning module.

    from statistics import mean
    
    from numberpartitioning import karmarkar_karp
    
    
    superset = [100, 100, 100, 100, 100, 103, 103, 103, 103, 103, 103, 103, 103, 103, 103, 104, 104, 104, 104, 104]
    
    print("superset mean", mean(superset))
    for p in karmarkar_karp(superset, num_parts=3).partition:
        print(p, mean(p))
    

    Prints:

    superset mean 102.5
    [104, 104, 103, 103, 103, 100] 102.83333333333333
    [100, 103, 104, 103, 103, 103, 100] 102.28571428571429
    [100, 104, 104, 103, 103, 103, 100] 102.42857142857143