Search code examples
python-3.xpandascrosstab

pandas crosstab and normalising indexes


How do you normalize a pandas crosstab which has multiindex?

Suppose you have df like this:

# RANDOM DATA
np.random.seed(2)

Year = [2020,2019,2018,2017]*1000
col1 = ['A','B','C','D']*1000
col2 = np.random.randint(0,2,4000)
weight = np.random.randint(1,3,4000)

random.shuffle(Year)
random.shuffle(col1)
random.shuffle(col2)
random.shuffle(weight)

column_names = ['Year', 'weight', 'col1', 'col2']

df = pd.DataFrame(columns=column_names)

df['Year'] = Year
df['col1'] = col1
df['col2'] = col2
df['weight'] = weight

And now you do a crosstab out of that:

pd.crosstab(index=[df['Year']],
            columns=[df['col1'], df['col2']],
            values=df['weight'],
            aggfunc=sum)

col1         A                   B                   C                   D  \
col2         0         1         0         1         0         1         0   
Year                                                                         
2017  0.117962  0.128686  0.128016  0.130697  0.137399  0.122654  0.115282   
2018  0.116832  0.111551  0.120132  0.118152  0.138614  0.125413  0.131353   
2019  0.137584  0.126846  0.127517  0.108725  0.114765  0.138255  0.114765   
2020  0.116356  0.134309  0.113032  0.143617  0.121676  0.118351  0.121676   

col1            
col2         1  
Year            
2017  0.119303  
2018  0.137954  
2019  0.131544  
2020  0.130984  

How do you normalize within the multiindex?

My expected output would be:

col1    A   A   B   B   C   C   D   D
col2    0   1   0   1   0   1   0   1
Year                                
2017    0.478   0.522   0.495   0.505   0.528   0.472   0.491   0.509
2018    0.512   0.488   0.504   0.496   0.525   0.475   0.488   0.512
2019    0.520   0.480   0.540   0.460   0.454   0.546   0.466   0.534
2020    0.464   0.536   0.440   0.560   0.507   0.493   0.482   0.518

Solution

  • I think you can use groupby().value_counts with normalize:

    #### Random data ####
    np.random.seed(2)
    Year = [2020,2019,2018,2017]*1000
    col1 = np.repeat(['A','B','C','D'],1000)
    col2 = np.random.randint(0,2,4000)
    
    column_names = ['Year', 'col1', 'col2']
    
    df = pd.DataFrame(columns=column_names)
    
    df['Year'] = Year
    df['col1'] = col1
    df['col2'] = col2
    #### End random data ####
    
    
    (df.groupby(['Year','col1'])['col2']
       .value_counts(normalize=True)
       .unstack(['col1','col2'])
    )
    

    Output:

    col1      A             B             C             D       
    col2      0      1      1      0      0      1      1      0
    Year                                                        
    2017  0.524  0.476  0.528  0.472  0.540  0.460  0.512  0.488
    2018  0.496  0.504  0.484  0.516  0.560  0.440  0.524  0.476
    2019  0.496  0.504  0.552  0.448  0.484  0.516  0.508  0.492
    2020  0.500  0.500  0.540  0.460  0.436  0.564  0.484  0.516
    

    Update: For the weighted value_counts:

    new_df = df.groupby(['Year','col1','col2'])['weight'].sum()
    new_df /= new_df.sum(level=['Year','col1'])
    new_df = new_df.unstack(['col1','col2'])
    

    Output:

    col1         A                   B                   C                   D  \
    col2         0         1         0         1         0         1         0   
    Year                                                                         
    2017  0.441176  0.558824  0.542500  0.457500  0.474286  0.525714  0.482857   
    2018  0.500000  0.500000  0.474114  0.525886  0.481268  0.518732  0.489461   
    2019  0.521622  0.478378  0.474490  0.525510  0.498801  0.501199  0.531343   
    2020  0.493671  0.506329  0.479339  0.520661  0.515789  0.484211  0.513021   
    
    col1            
    col2         1  
    Year            
    2017  0.517143  
    2018  0.510539  
    2019  0.468657  
    2020  0.486979