I am working on a use case where I have many observations of many features per id and I need to count the frequency of values (~5 discrete values) for each id per column.
I have a solution for it which works for a fairly small dataset (e.g. < 1million rows) but fails on my whole dataset (which might get even larger in the future) since it fills my RAM. I could not find a clean groupby solution since I need to do it for many columns at once.
Sample data:
import pandas as pd
import numpy as np
n = 100 # Number of features
m = 100 # Number of classes
k = 10000 # number of occurrences per class
possible_values = [1, 2, 3, 4, 5]
df = []
for i in range(m):
for j in range(k):
df.append( np.append(i, np.random.choice(possible_values, n)) )
features = [f"feature{i}" for i in range(n)]
df = pd.DataFrame(df, columns=(["id"] + features))
It is easy enough without the groupby case:
df[features].apply(pd.value_counts).T / df.shape[0]
My approach
melted = df.melt(id_vars="id", var_name='feature', value_name='value')
feature_freq_id = pd.crosstab(index=[melted.id, melted.feature], columns=melted.value).reset_index()
feature_freq_id[possible_values] = feature_freq_id[possible_values].div(feature_freq_id[possible_values].sum(axis=1), axis=0)
The problem is that melted
has n*m*k
rows. My dataset has >250 features, >200 ids and ~5k observations per id which means that melted
will have >250 mil rows. This results in my memory being eventually filled and python dies.
Expected result:
feature_freq_id.head(3)
id | feature | 1 | 2 | 3 | 4 | 5 | |
---|---|---|---|---|---|---|---|
0 | 0 | feature0 | 0.183 | 0.185 | 0.226 | 0.187 | 0.219 |
1 | 0 | feature1 | 0.178 | 0.222 | 0.209 | 0.209 | 0.182 |
2 | 0 | feature10 | 0.215 | 0.213 | 0.175 | 0.196 | 0.201 |
Just an idea: Use groupby
over id
in combination with your "easy" method:
def fractions(sdf):
return sdf.apply(pd.value_counts, normalize=True).fillna(0.).T
result = df.groupby("id")[features].apply(fractions)
result.index.set_names("feature", level=1, inplace=True)
This should avoid the memory melt
-down?