I have a pandas dataframe e.g.
data = {'col_1': ['a', 'b'], 'col_2': ['b', 'c']}
df = pd.DataFrame(data)
I want to convert this to a sparse representation of the data in numpy e.g.
[[[1, 0], [0, 0]], [[0, 1], [1, 0]], [[0, 0], [0, 1]]]
where the each 2x2 matrix represents the occurences of 'a', 'b' and 'c' in my pandas dataframe.
I can achieve the desired outcome through some messy operations:
boolean_matrix = pd.get_dummies(df, prefix='', prefix_sep='').groupby(level=0, axis=1).sum()
boolean_matrix = boolean_matrix.values.tolist()
boolean_matrix = [[[int(i == j) for j in range(len(boolean_matrix[0]))] for i in row] for row in boolean_matrix]
print(boolean_matrix)
But I can't believe this is the standard way to do what is probably a pretty common operation, are there any inbuild methods (pandas, polars, numpy, tensorflow) that will do this?
Let's use numpy broadcasting and unique
:
out = (df.to_numpy() == np.unique(df)[:,None,None]).astype(int)
Or, for a specific order:
out = (df.to_numpy() == np.array(['a', 'b', 'c'])[:,None,None]).astype(int)
Output:
array([[[1, 0],
[0, 0]],
[[0, 1],
[1, 0]],
[[0, 0],
[0, 1]]])