I need to create a dataframe with all the possible unique combinations of n number of original arrays given a couple constraints. I want to be able to do this without filtering down the initial dataframe due to memory constraints.
There will be two types of original input arrays. Either they will have boolean values of just True and False, or they will have a variable amount of float values.
The additional tricky layer is that any row should only have one non zero float value, the other float values must be a 0.
inputs = {
"a": [True, False],
"b": [True, False],
"c": [0.0, 0.1, 0.2],
"d": [0.0, 0.1, 0.2, 0.3],
}
bool_inputs = {"a", "b"}
float_inputs = {"c", "d"}
a b c d
0 True True 0.0 0.0
1 True False 0.0 0.0
2 False True 0.0 0.0
3 False False 0.0 0.0
4 True True 0.1 0.0
5 True False 0.1 0.0
6 False True 0.1 0.0
7 False False 0.1 0.0
8 True True 0.2 0.0
9 True False 0.2 0.0
10 False True 0.2 0.0
11 False False 0.2 0.0
12 True True 0.0 0.1
13 True False 0.0 0.1
14 False True 0.0 0.1
15 False False 0.0 0.1
16 True True 0.0 0.2
17 True False 0.0 0.2
18 False True 0.0 0.2
19 False False 0.0 0.2
20 True True 0.0 0.3
21 True False 0.0 0.3
22 False True 0.0 0.3
23 False False 0.0 0.3
I have been able to do this by filtering the data afterwards with the below solution, but I am wanting to not have any filtering. A bonus would also be not having to fix the column types
import numpy as np
import pandas as pd
input_arrays = list(inputs.values())
results = np.array(np.meshgrid(*input_arrays)).T.reshape(-1, len(inputs))
df = pd.DataFrame(results, columns=list(inputs.keys()))
df[list(bool_inputs)] = df[list(bool_inputs)].astype(bool)
df = df[~(df[list(float_inputs)] > 0).all(axis=1)]
df = df.reset_index(drop=True)
Key to not having to filter the float columns is making a block diagonal matrix. Everything else here is just .join(..., how = 'cross')
from scipy.linalg import block_diag
import pandas as pd
import numpy as np
inputs = {
"a": [True, False],
"b": [True, False],
"c": [0.0, 0.1, 0.2],
"d": [0.0, 0.1, 0.2, 0.3],
}
bool_inputs = {"a", "b"}
float_inputs = {"c", "d"}
num = block_diag(*[np.array(inputs[k])[None, np.flatnonzero(inputs[k])] for k in float_inputs]).T
df = pd.DataFrame(columns = float_inputs, data = num)
for k in bool_inputs:
df = df.join(pd.DataFrame(columns = [k], data = inputs[k]), how = 'cross')
df = df.reindex(sorted(df.columns), axis=1)
Output:
a b c d
0 True True 0.1 0.0
1 False True 0.1 0.0
2 True False 0.1 0.0
3 False False 0.1 0.0
4 True True 0.2 0.0
5 False True 0.2 0.0
6 True False 0.2 0.0
7 False False 0.2 0.0
8 True True 0.0 0.1
9 False True 0.0 0.1
10 True False 0.0 0.1
11 False False 0.0 0.1
12 True True 0.0 0.2
13 False True 0.0 0.2
14 True False 0.0 0.2
15 False False 0.0 0.2
16 True True 0.0 0.3
17 False True 0.0 0.3
18 True False 0.0 0.3
19 False False 0.0 0.3
EDIT: I assumed that "Only one non-zero float column" was strict, otherwise replace:
num = block_diag(*[np.array(inputs[k])[None, np.flatnonzero(inputs[k])] for k in float_inputs]).T
df = pd.DataFrame(columns = float_inputs, data = num)
with:
num = block_diag(*np.array(inputs[k])[None, np.flatnonzero(inputs[k])] for k in float_inputs]).T
num = np.r_[np.zeros_like(num)[[0], :], num]
df = pd.DataFrame(columns = float_inputs, data = num)