I'm considering a multiindexed (i,j,k)
DataFrame with one column that contains three categorical variables A
, B
or C
.
I want to compute the frequency of the variables for each i
over all (j,k)
. I have a solution, but I think there exists a more pythonic and efficient way of doing it.
The code for a MWE reads (in reality len(I)*len(J)*len(K)
is large, in the millions, say):
import pandas as pd
import numpy as np
I = range(10)
J = range(3)
K = range(2)
data = pd.DataFrame(
np.random.randint(0, 3, size=len(I)*len(J)*len(K)),
index=pd.MultiIndex.from_product([I, J, K]),
columns=['cat']
)
data.index.names = ['i', 'j', 'k']
data[data['cat'] == 0] = 'A'
data[data['cat'] == 1] = 'B'
data[data['cat'] == 2] = 'C'
data = data.unstack(['j', 'k'])
result = data.apply(lambda x: x.value_counts(), axis=1).fillna(0) / (len(J)*len(K))
You could use groupby
, and also normalize your value_counts
:
data.groupby(level=0)['cat'].value_counts(normalize=True).unstack(level=1).fillna(0)
To compare, first let's make the dummy data big (60 million rows):
import pandas as pd
import numpy as np
I = range(100000)
J = range(30)
K = range(20)
data = pd.DataFrame(
np.random.randint(0, 3, size=len(I)*len(J)*len(K)),
index=pd.MultiIndex.from_product([I, J, K]),
columns=['cat']
)
data.index.names = ['i', 'j', 'k']
data[data['cat'] == 0] = 'A'
data[data['cat'] == 1] = 'B'
data[data['cat'] == 2] = 'C'
Timing your original method:
data_interim = data.unstack(['j', 'k'])
data_interim.apply(lambda x: x.value_counts(), axis=1).fillna(0) / (len(J)*len(K))
gives (on my machine) 1min 24s ± 1.98 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
the alternative:
data.groupby(level=0)['cat'].value_counts(normalize=True).unstack(level=1).fillna(0)
gives (on my machine) 8.86 s ± 216 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)