Search code examples
pythonpandasdataframevariance

calculate aggregated variance for each group in python


I have a data frame (df) with these columns: user, vector, and group.

df = pd.DataFrame({'user': ['user_1', 'user_2', 'user_3', 'user_4', 'user_5',  'user_6'], 'vector': [[1, 0, 2, 0], [1, 8, 0, 2],[6, 2, 0, 0], [5, 0, 2, 2], [3, 8, 0, 0],[6, 0, 0, 2]], 'group': ['A', 'B', 'C', 'B', 'A', 'A']})

I want to calculate aggregated variance for each group.

I tried this code, but it return an error

aggregated_variance = (df.groupby('group', as_index=False)['vector'].agg(["var"]))

ValueError: no results


Solution

  • You can use .explode to clean up your data and then perform a .groupby operation:

    out = (
        df.explode('vector')
        .groupby('group')['vector'].var(ddof=1)
    )
    
    print(out)
    group
    A    7.060606
    B    7.428571
    C    8.000000
    Name: vector, dtype: float64
    

    The trick here lies in the use of .explode:

    >>> df.head()
         user        vector group
    0  user_1  [1, 0, 2, 0]     A
    1  user_2  [1, 8, 0, 2]     B
    2  user_3  [6, 2, 0, 0]     C
    3  user_4  [5, 0, 2, 2]     B
    4  user_5  [3, 8, 0, 0]     A
    
    >>> df.explode('vector').head()
         user vector group
    0  user_1      1     A
    0  user_1      0     A
    0  user_1      2     A
    0  user_1      0     A
    1  user_2      1     B
    ...