Search code examples
pythonpandasdataframenumpypandas-groupby

pandas crosstab simplified view of multiple columns


I already referred the posts here and here. Don't mark it as duplicate

I have a dataframe like as below

id,status,country,amount,qty
1,pass,USA,123,4500
1,pass,USA,156,3210
1,fail,UK,687,2137
1,fail,UK,456,1236
2,pass,AUS,216,324
2,pass,AUS,678,241
2,nan,ANZ,637,213
2,pass,ANZ,213,543

sf = pd.read_clipboard(sep=',')

I would like to get the percentage of values from each column as a seperate column

So, with the help of this post, I tried the below

Approach - 1 Doesn't give expected output shape

(pd.crosstab(sf['id'],[sf['status'].fillna('nan'),sf['country'].fillna('nan')],normalize=0)
          .drop('nan', 1)
          .mul(100)).reset_index()

Approach - 2 - Doesn't give expected output

sf_inv= sf.melt()
pd.crosstab(sf_inv.value, sf_inv.variable)

I expect my output to be like as below

enter image description here


Solution

  • You can use crosstab with normalize='index' on your different columns and concat the results:

    pd.concat([pd.crosstab(sf['id'], sf[c], normalize='index')
               for c in ['status', 'country']], axis=1).mul(100).add_suffix('_pct')
    

    output:

        fail_pct  pass_pct  ANZ_pct  AUS_pct  UK_pct  USA_pct
    id                                                       
    1       50.0      50.0      0.0      0.0    50.0     50.0
    2        0.0     100.0     50.0     50.0     0.0      0.0
    
    handling NaNs:
    pd.concat([pd.crosstab(sf['id'], sf[c].fillna('NA'), normalize='index')
                 .drop(columns='NA', errors='ignore')
               for c in ['status', 'country']], axis=1).mul(100).add_suffix('_pct')
    

    output:

        fail_pct  pass_pct  ANZ_pct  AUS_pct  UK_pct  USA_pct
    id                                                       
    1       50.0      50.0      0.0      0.0    50.0     50.0
    2        0.0      75.0     50.0     50.0     0.0      0.0