pythonsetcombinatorics# How to create N subsets with x0,..,xn elements, without replacement, so that each subset has the mean of superset?

### Karmarkar-Karp algorithm

I'm looking for advice what to google/efficient ways how to sample without replacement a superset containing M elements into N subsets with x0,...,xn elements, where the sum of all elements of subsets equals M (sum(x0,...,xn) == M), so that mean of every subset is as close to the superset's mean as possible. The elements are real numbers (floats; positive floats like 100.24 in most cases).

Example 1 (perfect allocation):

```
Superset contains:
- "100" 5 times
- "101" 10 times
- "102" 5 times
mean = 101
Superset = {100, 100, 100, 100, 100, 101, 101, 101, 101, 101, 101, 101, 101, 101, 101, 102, 102, 102, 102, 102}
Create 3 subsets containing: 2, 4, 14 elements:
A = {100, 102}
B = {100, 102, 100, 102}
C = {100, 100, 101, 101, 101, 101, 101, 101, 101, 101, 101, 101, 102, 102}
Every subset has a mean of 101.
```

Example 2 (best fit - perfect not possible):

```
Superset contains:
- "100": 5 times
- "103": 10 times
- "104": 5 times
mean = 102.5
Superset = {100, 100, 100, 100, 100, 103, 103, 103, 103, 103, 103, 103, 103, 103, 103, 104, 104, 104, 104, 104}
Create 3 subsets containing: 2, 4, 14 elements:
A = {100, 104}, mean = 102
B = {100, 103, 103, 104}, mean = 102.5
C = {100, 100, 100, 103, 103, 103, 103, 103, 103, 103, 103, 104, 104, 104}, mean = 102.5714
OR
A = {100, 104}, mean = 102
B = {100, 103, 104, 104}, mean = 102.75
C = {100, 100, 100, 103, 103, 103, 103, 103, 103, 103, 104, 104, 104, 104}, mean = 102.5
```

I think the most "fair" way is to describe the error function as the sum of subsets' divergences from the superset's mean and to optimize for minimization of such error. Therefore for example 2, 1st solution is better than 2nd.

I'm looking in general what field in math/CS covers such problems so that I can read on existing solutions as well as any advice on how to best do it/how you would do it.

In my case there is also a "rough" constraint that regardless of input data, the algo should run under 1 second on a "everyday machine". In regards to input data - in majority of cases the number of subsets will be 10-25, upper limit is 100. The amount of elements in superset has no upper limit, however in most cases there will be no more than 100-1000 unique elements. Let's say upper limit for unique elements is 10000, but this is an edge case.

So far I got to these conclusions: Depending on how many unique numbers there are in superset and how many subsets there are, the case is more or less computationally demanding due to the amount of possible combinations.

For this reason I would employ different algorithms for different sets:

- Computationally light - find all possible combinations and choose best fit.
- Computationally medicore - Starting from the smallest subset, allocate as good as possible, set by set - This is not "fair", but will result in small general error, as subsets with big amount of elements will always be 'relatively' close to the superset's mean even with random numbers.
- Computationally Demanding - Pre-allocate the subsets to 50-75% of their sizes by sampling the superset uniformly (this will make every subset close to original mean) and then employ algorithm 2) to get the subsets as close to original mean as possible. The pre-allocation size would depend on the ratio of number of subsets to the amount of unique numbers and elements in superset.

I haven't found a good way to reduce a big problem into smaller problem and then expand the solution back to original problem. I have though about doing the following:

```
Superset contains:
- "100": 100 times
- "103": 200 times
- "104": 500 times
Create 3 subsets containing: 200, 250, 350 elements
Reduce (50x) to:
Superset contains:
- "100": 2 times
- "103": 4 times
- "104": 10 times
Create 3 subsets containing: 4, 5, 7 elements.
```

But the obtainable results are worse then when solving for original case and I think uniform pre-allocation will yield better results then this "size reduction" method.

Any advice is appreciated. I will code this in Python(Numpy). Please add any tags that are appropriate.

Solution

IIUC, you should google for `set partitioning problem`

. Here is an example solution using Pulp (link):

```
from statistics import mean
import pulp
superset = [100, 100, 100, 100, 100, 101, 101, 101, 101, 101, 101, 101, 101, 101, 101, 102, 102, 102, 102, 102]
target_sum = sum(superset)
set_sizes = [2, 4, 14]
N = len(set_sizes)
assert sum(set_sizes) == len(superset)
covering = {}
for s in range(N):
vals = []
for i, v in enumerate(superset):
vals.append(
pulp.LpVariable(
f"covering_set_{s}_value_idx_{i:>02}_val_{v}",
lowBound=0.0,
upBound=1.0,
cat=pulp.LpInteger,
)
)
covering[s] = vals
set_partitioning_model = pulp.LpProblem("Set_Partitioning_Model", pulp.LpMinimize)
abs_sum_errs = []
for s_i, st in enumerate(covering.values()):
set_sum_err_abs = pulp.LpVariable(f"set_{s_i}_sum_error_abs")
abs_sum_errs.append(set_sum_err_abs)
# OBJECTIVE: minimize sum of absolute errors
set_partitioning_model += pulp.lpSum(abs_sum_errs)
for s_i, st in enumerate(covering.values()):
set_sum_err = pulp.LpVariable(f"set_{s_i}_sum_error")
set_partitioning_model += set_sum_err == (
target_sum - (pulp.lpSum([p * superset[i] for i, p in enumerate(st)]))
)
# ABS VALUE CONSTRAINTS
# https://stackoverflow.com/questions/59233699/mathematical-operation-of-abs-in-objective-function-of-pulp/59566067#59566067
set_partitioning_model += abs_sum_errs[s_i] >= set_sum_err
set_partitioning_model += abs_sum_errs[s_i] >= -set_sum_err
# set sizes are predetermined
for n, st in zip(set_sizes, covering.values()):
set_partitioning_model += pulp.lpSum(st) == n
# every value can be used only once in set:
for s_i, st in enumerate(zip(*covering.values())):
set_partitioning_model += (
pulp.lpSum(st) == 1,
f"value {s_i} must be used only once",
)
set_partitioning_model.solve()
means = []
for k, v in covering.items():
lst = [superset[idx] for idx, i in enumerate(v) if i.value() == 1]
print(lst, mean(lst))
```

The solution:

```
[101, 101] 101
[100, 100, 102, 102] 101
[100, 100, 100, 101, 101, 101, 101, 101, 101, 101, 101, 102, 102, 102] 101
```

```
superset = [100, 100, 100, 100, 100, 103, 103, 103, 103, 103, 103, 103, 103, 103, 103, 104, 104, 104, 104, 104]
```

The solution is:

```
[103, 103] 103
[100, 100, 104, 104] 102
[100, 100, 100, 103, 103, 103, 103, 103, 103, 103, 103, 104, 104, 104] 102.57142857142857
```

Note: *(Just for info here - I'm not sure how you can specify set sizes here)*

For heuristics, you can google for `Karmarkar-Karp algorithm`

. Here is an example using `numberpartitioning`

module.

```
from statistics import mean
from numberpartitioning import karmarkar_karp
superset = [100, 100, 100, 100, 100, 103, 103, 103, 103, 103, 103, 103, 103, 103, 103, 104, 104, 104, 104, 104]
print("superset mean", mean(superset))
for p in karmarkar_karp(superset, num_parts=3).partition:
print(p, mean(p))
```

Prints:

```
superset mean 102.5
[104, 104, 103, 103, 103, 100] 102.83333333333333
[100, 103, 104, 103, 103, 103, 100] 102.28571428571429
[100, 104, 104, 103, 103, 103, 100] 102.42857142857143
```

- Python Jinja2 LaTeX Table
- Getting attributes of a class
- How can I print many significant figures in Python?
- How to allow list append() method to return the new list
- Calculate Last Friday of Month in Pandas
- Python type hint for Iterable[str] that isn't str
- How to iterate over a list in chunks
- How to exit the entire application from a Python thread?
- Running shell command and capturing the output
- How do I pass a variable by reference?
- Convert range(r) to list of strings of length 2 in python
- How can I get the start and end dates for each week?
- how to use send_message() in python-telegram-bot
- Python conditional replacement based on element type
- How can I count the number of items in an arbitrary iterable (such as a generator)?
- Find longest consecutive range of numbers in list
- Insert text in braces with asyncpg
- How does one put a link / url to the web-site's home page in Django?
- How to determine if a path is a subdirectory of another?
- Custom Keybindings for Ipython terminal
- FastAPI asynchronous background tasks blocks other requests?
- How to make sure that information from one file is duplicated into several text documents, without specific lines
- Installing a Python environment with Anaconda
- sklearn pipeline model predicting same results for all input
- Brew command not found after installing Anaconda Python
- How to get an XPath from selenium webelement or from lxml?
- Pipe PuTTY console to Python script
- How to align the axes of a figure in matplotlib?
- Persist ParentDocumentRetriever of langchain
- How to reset index in a pandas dataframe?