python pandas count pandas-groupby frequency

More efficient way for: value_counts (in %) for many columns grouped by a variable

I am working on a use case where I have many observations of many features per id and I need to count the frequency of values (~5 discrete values) for each id per column.

I have a solution for it which works for a fairly small dataset (e.g. < 1million rows) but fails on my whole dataset (which might get even larger in the future) since it fills my RAM. I could not find a clean groupby solution since I need to do it for many columns at once.

Sample data:

import pandas as pd
import numpy as np

n = 100     # Number of features
m = 100     # Number of classes
k = 10000   # number of occurrences per class

possible_values = [1, 2, 3, 4, 5]

df = []
for i in range(m):
    for j in range(k):
        df.append( np.append(i, np.random.choice(possible_values, n)) )

features = [f"feature{i}" for i in range(n)]
df = pd.DataFrame(df, columns=(["id"] + features))

It is easy enough without the groupby case:

df[features].apply(pd.value_counts).T / df.shape[0]

My approach

melted = df.melt(id_vars="id", var_name='feature', value_name='value')
feature_freq_id = pd.crosstab(index=[melted.id, melted.feature], columns=melted.value).reset_index()
feature_freq_id[possible_values] = feature_freq_id[possible_values].div(feature_freq_id[possible_values].sum(axis=1), axis=0)

The problem is that melted has n*m*k rows. My dataset has >250 features, >200 ids and ~5k observations per id which means that melted will have >250 mil rows. This results in my memory being eventually filled and python dies.

Expected result:

feature_freq_id.head(3)

	feature	1	2	3	4	5
0	feature0	0.183	0.185	0.226	0.187	0.219
1	feature1	0.178	0.222	0.209	0.209	0.182
2	feature10	0.215	0.213	0.175	0.196	0.201

Solution

Just an idea: Use groupby over id in combination with your "easy" method:

def fractions(sdf):
    return sdf.apply(pd.value_counts, normalize=True).fillna(0.).T
    
result = df.groupby("id")[features].apply(fractions)
result.index.set_names("feature", level=1, inplace=True)

This should avoid the memory melt-down?