Search code examples
pythonpandascountpandas-groupbyfrequency

More efficient way for: value_counts (in %) for many columns grouped by a variable


I am working on a use case where I have many observations of many features per id and I need to count the frequency of values (~5 discrete values) for each id per column.

I have a solution for it which works for a fairly small dataset (e.g. < 1million rows) but fails on my whole dataset (which might get even larger in the future) since it fills my RAM. I could not find a clean groupby solution since I need to do it for many columns at once.

Sample data:

import pandas as pd
import numpy as np

n = 100     # Number of features
m = 100     # Number of classes
k = 10000   # number of occurrences per class

possible_values = [1, 2, 3, 4, 5]

df = []
for i in range(m):
    for j in range(k):
        df.append( np.append(i, np.random.choice(possible_values, n)) )

features = [f"feature{i}" for i in range(n)]
df = pd.DataFrame(df, columns=(["id"] + features))

It is easy enough without the groupby case:

df[features].apply(pd.value_counts).T / df.shape[0]

My approach

melted = df.melt(id_vars="id", var_name='feature', value_name='value')
feature_freq_id = pd.crosstab(index=[melted.id, melted.feature], columns=melted.value).reset_index()
feature_freq_id[possible_values] = feature_freq_id[possible_values].div(feature_freq_id[possible_values].sum(axis=1), axis=0)

The problem is that melted has n*m*k rows. My dataset has >250 features, >200 ids and ~5k observations per id which means that melted will have >250 mil rows. This results in my memory being eventually filled and python dies.

Expected result:

feature_freq_id.head(3)
id feature 1 2 3 4 5
0 0 feature0 0.183 0.185 0.226 0.187 0.219
1 0 feature1 0.178 0.222 0.209 0.209 0.182
2 0 feature10 0.215 0.213 0.175 0.196 0.201

Solution

  • Just an idea: Use groupby over id in combination with your "easy" method:

    def fractions(sdf):
        return sdf.apply(pd.value_counts, normalize=True).fillna(0.).T
        
    result = df.groupby("id")[features].apply(fractions)
    result.index.set_names("feature", level=1, inplace=True)
    

    This should avoid the memory melt-down?