Search code examples
pandasgroup-by

filling a data frame with N count for two distinct categories


sample data:

type1 = ['bunny','cow','pig','chicken','elephant']
type2 = ['flower','tree','grass']

data_type1 = np.random.choice(a = type1, p = [0.05,0.15,0.25,0.05,0.5], size=1000)
data_type2 = np.random.choice(a = type2, p = [0.25,0.25,0.50], size=1000)
data_value = stats.poisson.rvs(loc = 18, mu = 30, size = 1000)

df = pd.DataFrame({'type1':data_type1,'type2':data_type2,'data':data_value})

grp1 = pd.unique(df.type1.values)
grp1.sort()
grp2 = pd.unique(df.type2.values)
grp2.sort()
m_df = pd.DataFrame(index = grp1,columns=grp2)

given a df structured as 'df', how would I populate 'm_df' with the count of each type combo. I want to accomplish df.groupby(['type1','type2'])['data'].count() but I'm not sure how to write it into a df so it looks nice and is more useful.

Edit: below is a simpler df that also works as an example. The expected output is the number of values for each type combination -- the exact output shown when the input is df.groupby(['type1','type2'])['data'].count(). The question is how to represent that output as a dataframe resembling m_df.

df = pd.DataFrame({'type1': ['bunny','cow','pig','chicken','elephant','cow','pig'],
           'data': [32,23,45,35,20,28,55],
           'type2':['female', 'male','male','male','male','female','female']})

Solution

  • IIUC, use value_counts (same as groupby_count) then unstack the second index level:

    >>> (df.value_counts(['type1', 'type2'])
           .unstack('type2', fill_value=0)
           .rename_axis(index=None, columns=None))
    
              female  male
    bunny          1     0
    chicken        0     1
    cow            1     1
    elephant       0     1
    pig            1     1
    

    You can also use pd.crosstab or pivot_table:

    >>> (pd.crosstab(df['type1'], df['type2'], 1, aggfunc='count')
       .fillna(0).astype(int).rename_axis(index=None, columns=None))
    
    >>> (df.assign(val=1).pivot_table('val', 'type1', 'type2', aggfunc=sum, fill_value=0)
       .rename_axis(index=None, columns=None))