Search code examples
pythonpandasnumpymultiprocessingbins

Binning pandas/numpy array in unequal sizes with approx equal computational cost


I have a problem where data must be processed across multiple cores. Let df be a Pandas DataFrameGroupBy (size()) object. Each value represent the computational "cost" each GroupBy has for the cores. How can I divide df into n-bins of unequal sizes and with the same (approx) computational cost?

import pandas as pd
import numpy as np
size = 50
rng = np.random.default_rng(2021)
df = pd.DataFrame({
    "one": np.linspace(0, 10, size, dtype=np.uint8),
    "two": np.linspace(0, 5, size, dtype=np.uint8),
    "data": rng.integers(0, 100, size)
})
groups = df.groupby(["one", "two"]).sum()
df
    one  two  data
0     0    0    75
1     0    0    75
2     0    0    49
3     0    0    94
4     0    0    66
...
45    9    4    12
46    9    4    97
47    9    4    12
48    9    4    32
49   10    5    45

People typically split the dataset into n-bins, such as the code below. However, splitting the dataset into n-equal parts is undesirable because the cores receives very unbalanced workload, e.g. 205 vs 788.

n = 4
bins = np.array_split(groups, n) # undesired
[b.sum() for b in bins]  #undesired
[data    788
dtype: int64, data    558
dtype: int64, data    768
dtype: int64, data    205
dtype: int64]

A desired solution is splitting the data into bins of unequal sizes and with approximately equal large summed values. I.e. the difference between abs(743-548) = 195 is smaller than the previous method abs(205-788) = 583. The difference should be as small as possible. A simple list-example of how it should be achieved:

# only an example to demonstrate desired functionality
example = [[[10, 5], 45], [[2, 1], 187], [[3, 1], 249], [[6, 3], 262]], [[[9, 4], 153], [[4, 2], 248], [[1, 0], 264]], [[[8, 4], 245], [[7, 3], 326]], [[[5, 2], 189], [[0, 0], 359]]

[sum([size for (group, size) in test]) for test in t]  # [743, 665, 571, 548]

Is there a more efficient method to split the dataset into bins as described above in pandas or numpy?

It is important to split/bin the GroupBy object, accessing the data in a similar way as returned by np.array_split().


Solution

  • I think a good approach has been found. Credits to a colleague.

    The idea is to sort the group sizes (in descending order) and put groups into bins in a "backward S"-pattern. Let me illustrate with an example. Assume n = 3 (number of bins) and the following data:

    groups
        data
    0    359
    1    326
    2    264
    3    262
    4    249
    5    248
    6    245
    7    189
    8    187
    9    153
    10    45
    

    The idea is to put one group in one bin going "right to left" (and vica versa) between the bins in a "backward S"-pattern. First element in bin 0, second element in bin 1, etc. Then go backwards when reaching the last bin: fourth element in bin 2, fifth element in bin 1, etc. See below how the elements are put into bins by the group number in parenthesis. The values are the group sizes.

     Bins:  |    0    |    1    |    2    |
            |  359 (0)|  326 (1)|  264 (2)|  
            |  248 (5)|  249 (4)|  262 (3)|
            |  245 (6)|  189 (7)|  187 (8)|
            |         |   45(10)|  153 (9)|
    

    The bins will have approximately the same number of values and, thus, approximately the same computational "cost". The bin sizes are: [852, 809, 866] for anyone interested. I have tried on a real-world dataset and the bins are of similar sizes. It is not guaranteed bins will be of similar size for all datasets.

    The code can be made more efficient, but this is sufficient to get the idea out:

    n = 3
    size = 50
    rng = np.random.default_rng(2021)
    df = pd.DataFrame({
        "one": np.linspace(0, 10, size, dtype=np.uint8),
        "two": np.linspace(0, 5, size, dtype=np.uint8),
        "data": rng.integers(0, 100, size)
    })
    
    groups = df.groupby(["one", "two"]).sum()
    groups = groups.sort_values("data", ascending=False).reset_index(drop=True)
    
    bins = [[] for i in range(n)]
    backward = False
    i = 0
    for group in groups.iterrows():
        bins[i].append(group)
        i = i + 1 if not backward else i - 1
        if i == n:
            backward = True
            i -= 1
        if i == -1 and backward:
            backward = False
            i += 1
    
    
    [sum([size[0] for (group, size) in bin]) for bin in bins]