Search code examples
pythonpandasnumpy

Create dataframe with all unique combinations given a set of constraints


I need to create a dataframe with all the possible unique combinations of n number of original arrays given a couple constraints. I want to be able to do this without filtering down the initial dataframe due to memory constraints.

There will be two types of original input arrays. Either they will have boolean values of just True and False, or they will have a variable amount of float values.

The additional tricky layer is that any row should only have one non zero float value, the other float values must be a 0.

Example input:

inputs = {
    "a": [True, False],
    "b": [True, False],
    "c": [0.0, 0.1, 0.2],
    "d": [0.0, 0.1, 0.2, 0.3],
}

bool_inputs = {"a", "b"}
float_inputs = {"c", "d"}

Example Output:

        a      b    c    d
0    True   True  0.0  0.0
1    True  False  0.0  0.0
2   False   True  0.0  0.0
3   False  False  0.0  0.0
4    True   True  0.1  0.0
5    True  False  0.1  0.0
6   False   True  0.1  0.0
7   False  False  0.1  0.0
8    True   True  0.2  0.0
9    True  False  0.2  0.0
10  False   True  0.2  0.0
11  False  False  0.2  0.0
12   True   True  0.0  0.1
13   True  False  0.0  0.1
14  False   True  0.0  0.1
15  False  False  0.0  0.1
16   True   True  0.0  0.2
17   True  False  0.0  0.2
18  False   True  0.0  0.2
19  False  False  0.0  0.2
20   True   True  0.0  0.3
21   True  False  0.0  0.3
22  False   True  0.0  0.3
23  False  False  0.0  0.3

I have been able to do this by filtering the data afterwards with the below solution, but I am wanting to not have any filtering. A bonus would also be not having to fix the column types

import numpy as np
import pandas as pd


input_arrays = list(inputs.values())
results = np.array(np.meshgrid(*input_arrays)).T.reshape(-1, len(inputs))

df = pd.DataFrame(results, columns=list(inputs.keys()))

df[list(bool_inputs)] = df[list(bool_inputs)].astype(bool)
df = df[~(df[list(float_inputs)] > 0).all(axis=1)]
df = df.reset_index(drop=True)

Solution

  • Key to not having to filter the float columns is making a block diagonal matrix. Everything else here is just .join(..., how = 'cross')

    from scipy.linalg import block_diag
    import pandas as pd
    import numpy as np
    
    inputs = {
        "a": [True, False],
        "b": [True, False],
        "c": [0.0, 0.1, 0.2],
        "d": [0.0, 0.1, 0.2, 0.3],
    }
    
    bool_inputs = {"a", "b"}
    float_inputs = {"c", "d"}
    
    num = block_diag(*[np.array(inputs[k])[None, np.flatnonzero(inputs[k])] for k in float_inputs]).T
    df = pd.DataFrame(columns = float_inputs, data = num)
    
    
    for k in bool_inputs:
        df = df.join(pd.DataFrame(columns = [k], data = inputs[k]), how = 'cross')
        
    df = df.reindex(sorted(df.columns), axis=1)
    

    Output:

            a      b    c    d
    0    True   True  0.1  0.0
    1   False   True  0.1  0.0
    2    True  False  0.1  0.0
    3   False  False  0.1  0.0
    4    True   True  0.2  0.0
    5   False   True  0.2  0.0
    6    True  False  0.2  0.0
    7   False  False  0.2  0.0
    8    True   True  0.0  0.1
    9   False   True  0.0  0.1
    10   True  False  0.0  0.1
    11  False  False  0.0  0.1
    12   True   True  0.0  0.2
    13  False   True  0.0  0.2
    14   True  False  0.0  0.2
    15  False  False  0.0  0.2
    16   True   True  0.0  0.3
    17  False   True  0.0  0.3
    18   True  False  0.0  0.3
    19  False  False  0.0  0.3
    

    EDIT: I assumed that "Only one non-zero float column" was strict, otherwise replace:

    num = block_diag(*[np.array(inputs[k])[None, np.flatnonzero(inputs[k])] for k in float_inputs]).T
    df = pd.DataFrame(columns = float_inputs, data = num)
    

    with:

    num = block_diag(*np.array(inputs[k])[None, np.flatnonzero(inputs[k])] for k in float_inputs]).T
    num = np.r_[np.zeros_like(num)[[0], :], num]
    df = pd.DataFrame(columns = float_inputs, data = num)