Search code examples
pythonpandasgroup-bypivot-table

How to calculate percentage of column using pandas pivot_table()


I am attempting to get the frequency of objects per version expressed as a percentage.

Input dfsof = pd.read_clipboard()

file     version     object
path1    1.0         name
path1    1.0         session
path1    1.0         sequence
path2    2.01        name
path2    2.01        session
path2    2.01        sequence
path3    2.01        name
path3    2.01        session
path3    2.01        earthworm

Using the following, I am able to get frequency of each file.

dfsof.pivot_table(index=['object'], values=['file'], columns=['version'], aggfunc=len, fill_value=0, margins=True)

             file
version      1.0    2.01    All
object          
 earthworm   0      1       1
 name        1      2       3
 sequence    1      1       2
 session     1      2       3
All          3      6       9

I want to divide each count per object/version by the total number of distinct files for that version. Using the expected return table as an example, earthworms shows up in the input only once for version 2.01, so I expect 0% for version 1.0 and 50% for version 2.01 since only one of the files has that value.

Using dfsof.groupby('version')['file'].nunique() returns the frequency of files per version, which is the denominator for each of object/version in the table above. What I am struggling with is how to apply the denominator values to the pivot_table. I have seen examples of this using grand totals and subtotals but I can't seem to find nor figure out how to divide by the unique number of files per version. Any help would be greatly appreciated.

version
1.00    1
2.01    2

Expected return

             path
version      1.0    2.01    All
object          
 earthworm   0%     50%     1
 name        100%   100%    3
 sequence    100%   50%     2
 session     100%   100%    3
All          3      6       9

Solution

  • IIUC, you need to aggregate as 'nunique', then perform some recomputation:

    # aggregate using nunique
    out = dfsof.pivot_table(index=['object'], values=['file'], columns=['version'],
                            aggfunc='nunique', fill_value=0, margins=True)
    
    # compute and save sum of nunique
    All = out.iloc[:-1].sum()
    # update values as percentage
    out.update(out.iloc[:-1, :-1].div(out.iloc[-1, :-1]).mul(100))
    
    # assign sum of nunique
    out.loc['All'] = All
    
    print(out)
    

    Output:

                file           
    version      1.0   2.01 All
    object                     
    earthworm    0.0   50.0   1
    name       100.0  100.0   3
    sequence   100.0   50.0   2
    session    100.0  100.0   3
    All          3.0    6.0   9