Search code examples
pandasfrequencypercentagepandas-groupbymulti-level

Multi-level groupby sub-population percentages


Let's consider the following dataframe:

df = {'Location': ['A','A','B','B','C','C','A','C','A'],
'Gender'['M','M','F','M','M','F','M','M','M'],
'Edu'['N','N','Y','Y','Y','N','Y','Y','Y'],
'Access1': [1,0,1,0,1,0,1,1,1], 'Access2': [1,1,1,0,0,1,0,0,1] }
df = pd.DataFrame(data=d, dtype=np.int8)

Output from dataframe:

   Access1  Access2 Edu Gender Location
0        1        1   N      M        A
1        0        1   N      M        A
2        1        1   Y      F        B
3        0        0   Y      M        B
4        1        0   Y      M        C
5        0        1   N      F        C
6        1        0   Y      M        A
7        1        0   Y      M        C
8        1        1   Y      M        A

Then I am using groupby to analyse the frequencies in df

D0=df.groupby(['Location','Gender','Edu']).sum()
((D0/ D0.groupby(level = [0]).transform(sum))*100).round(3).astype(str) + '%'

Output:

                     Access1  Access2
Location Gender Edu                  
A        M      N    33.333%  66.667%
                Y    66.667%  33.333%
B        F      Y     100.0%   100.0%
         M      Y       0.0%     0.0%
C        F      N       0.0%   100.0%
         M      Y     100.0%     0.0%

From this output, I infer that 33.3% of uneducated men in location A with Access to service 1 (=Access1) is the result of considering 3 people in location A having access to service 1, of which 1 uneducated man has access to it (=1/3).

Yet, wish to get a different output. I would like to consider a total of 4 men in location A as my 100%. 50% of this group of men are uneducated. Out of that 50% of uneducated men, 25% have access to service 1. So, the percentage I would like to see in the table is 25% (total of uneducated men in area A accessing service 1). Is groupby the right way to get there, and what would be the best way to measure the % of Access to service 1 while considering a disaggregation from the total population of reference per location?


Solution

  • I believe need divide D0 by first level of MultiIndex mapped by a Series:

    D0=df.groupby(['Location','Gender','Edu']).sum()
    
    a = df['Location'].value_counts()
    #alternative
    #a = df.groupby(['Location']).size()
    print (a)
    A    4
    C    3
    B    2
    Name: Location, dtype: int64
    
    df1 = D0.div(D0.index.get_level_values(0).map(a.get), axis=0)
    print (df1)
                          Access1   Access2
    Location Gender Edu                    
    A        M      N    0.250000  0.500000
                    Y    0.500000  0.250000
    B        F      Y    0.500000  0.500000
             M      Y    0.000000  0.000000
    C        F      N    0.000000  0.333333
             M      Y    0.666667  0.000000
    

    Detail:

    print (D0.index.get_level_values(0).map(a.get))
    Int64Index([4, 4, 2, 2, 3, 3], dtype='int64', name='Location')