Search code examples
pythonpandasdataframegroup-bypandas-explode

Pandas groupby mean and value_counts but using another column of counts


I need a memory-efficient way of exploding a dataframe and then grouping by a column to get either normalized value_counts or the mean depending on dtype.

I have a dataframe similar to this:

   key  cnt X1  X2
0    1    8  a   1
1    1    3  b   0
2    1    4  a   0
3    2    2  b   1
4    2    6  a   0
5    3    3  a   1

The cnt variable indicates the count of the values in other columns. For example, in the first row you can interpret X1 has having 8 instances of 'a' (i.e., like ['a']*8 or ['a','a','a','a','a','a','a','a']).

I am able to do what I need using .apply() then .explode() then .groupby(), something like this:

df = pd.DataFrame([[1, 8, 'a', 1], [1, 3, 'b', 0], [1, 4, 'a', 0],
                   [2, 2, 'b', 1], [2, 6, 'a', 0], [3, 3, 'a', 1]],
                  columns=['key', 'cnt', 'X1', 'X2'])

df['X1'] = df.apply(lambda row: [row['X1']] * row['cnt'], axis=1)
df['X2'] = df.apply(lambda row: [row['X2']] * row['cnt'], axis=1)
df = df.explode(['X1', 'X2']).drop(columns=['cnt']).reset_index(drop=True)

vc = pd.DataFrame(df.groupby(['key'])['X1'].value_counts(normalize=True, dropna=True)).unstack()
vc.columns = [f'X1_{col}' for col in vc.columns.get_level_values(1).values]

df = pd.merge(left=vc.reset_index(),
              right=df.drop(columns=['X1']).groupby(['key']).agg('mean').astype(float).reset_index(),
              how='left')
print(df)

   key  X1_a  X1_b        X2
0    1  0.80  0.20  0.533333
1    2  0.75  0.25  0.250000
2    3  1.00   NaN  1.000000

But the data I'm working with is enormous, with many different variables that need to be aggregated like this, and most of the cnt values are >15000, which results in using too much memory and freezing my machine.

I feel like there must be a more memory-efficient way to do this. Anyone have any ideas?


Solution

  • There's a better way

    Calculate the normalized sum of counts per key and X1

    cnts = pd.crosstab(df['key'], df['X1'], df['cnt'], 
                       aggfunc='sum', normalize='index').add_prefix('X1_')
    
    # X1   X1_a  X1_b
    # key            
    # 1    0.80  0.20
    # 2    0.75  0.25
    # 3    1.00  0.00
    

    Calculated the weighted average of X2 per key

    weighted_sum = df['cnt'].mul(df['X2']).groupby(df['key']).sum()
    total_weight = df.groupby('key')['cnt'].sum()
    average = weighted_sum / total_weight
    
    # key
    # 1    0.533333
    # 2    0.250000
    # 3    1.000000
    # dtype: float64
    

    Join the dataframes

    result = cnts.join(average.rename('X2'))
    
    #      X1_a  X1_b        X2
    # key                      
    # 1    0.80  0.20  0.533333
    # 2    0.75  0.25  0.250000
    # 3    1.00  0.00  1.000000