Search code examples
pythonpandasfrequency

Efficient calculation of frequencies of multiindexed category variables


I'm considering a multiindexed (i,j,k) DataFrame with one column that contains three categorical variables A, B or C.

I want to compute the frequency of the variables for each i over all (j,k). I have a solution, but I think there exists a more pythonic and efficient way of doing it.

The code for a MWE reads (in reality len(I)*len(J)*len(K) is large, in the millions, say):

import pandas as pd
import numpy as np

I = range(10)
J = range(3)
K = range(2)

data = pd.DataFrame(
    np.random.randint(0, 3, size=len(I)*len(J)*len(K)),
    index=pd.MultiIndex.from_product([I, J, K]),
    columns=['cat']
)

data.index.names = ['i', 'j', 'k']

data[data['cat'] == 0] = 'A'
data[data['cat'] == 1] = 'B'
data[data['cat'] == 2] = 'C'


data = data.unstack(['j', 'k'])
result = data.apply(lambda x: x.value_counts(), axis=1).fillna(0) / (len(J)*len(K))

Solution

  • You could use groupby, and also normalize your value_counts:

    data.groupby(level=0)['cat'].value_counts(normalize=True).unstack(level=1).fillna(0)
    

    To compare, first let's make the dummy data big (60 million rows):

    import pandas as pd
    import numpy as np
    
    I = range(100000)
    J = range(30)
    K = range(20)
    
    data = pd.DataFrame(
        np.random.randint(0, 3, size=len(I)*len(J)*len(K)),
        index=pd.MultiIndex.from_product([I, J, K]),
        columns=['cat']
    )
    
    data.index.names = ['i', 'j', 'k']
    
    data[data['cat'] == 0] = 'A'
    data[data['cat'] == 1] = 'B'
    data[data['cat'] == 2] = 'C'
    

    Timing your original method:

    data_interim = data.unstack(['j', 'k'])
    data_interim.apply(lambda x: x.value_counts(), axis=1).fillna(0) / (len(J)*len(K))
    

    gives (on my machine) 1min 24s ± 1.98 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

    the alternative:

    data.groupby(level=0)['cat'].value_counts(normalize=True).unstack(level=1).fillna(0)
    

    gives (on my machine) 8.86 s ± 216 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)