Search code examples
pandasdataframepivot-table

Calculating different types of percentages in a 3-way crosstab with pandas


Suppose I have the results of a simple survey, where the respondents are grouped into groups A, B and C, and were asked a simple yes/no question. I have responses to this survey over 3 months (Aug, Sep and Oct 2023).

I have the data in the following format:

agg_counts=[
    {"group":'A','answer':'yes','month':'Aug 2023','count':12},
    {"group":'A','answer':'yes','month':'Sep 2023','count':5},
    {"group":'A','answer':'yes','month':'Oct 2023','count':4},
    {"group":'A','answer':'no','month':'Aug 2023','count':9},
    {"group":'A','answer':'no','month':'Sep 2023','count':10},
    {"group":'A','answer':'no','month':'Oct 2023','count':3},
    {"group":'B','answer':'yes','month':'Aug 2023','count':21},
    {"group":'B','answer':'yes','month':'Sep 2023','count':3},
    {"group":'B','answer':'yes','month':'Oct 2023','count':6},
    {"group":'B','answer':'no','month':'Aug 2023','count':8},
    {"group":'B','answer':'no','month':'Sep 2023','count':9},
    {"group":'B','answer':'no','month':'Oct 2023','count':2},
    {"group":'C','answer':'yes','month':'Aug 2023','count':11},
    {"group":'C','answer':'yes','month':'Sep 2023','count':4},
    {"group":'C','answer':'yes','month':'Oct 2023','count':9},
    {"group":'C','answer':'no','month':'Aug 2023','count':8},
    {"group":'C','answer':'no','month':'Sep 2023','count':1},
    {"group":'C','answer':'no','month':'Oct 2023','count':13}
    ]

A crosstab with the months as columns look like this:

month         Aug 2023  Oct 2023  Sep 2023
group answer                              
A     no             9         3        10
      yes           12         4         5
B     no             8         2         9
      yes           21         6         3
C     no             8        13         1
      yes           11         9         4

I would like to calculate vertical percentages within the groups (so the percentages within group A for Aug 2023 add up to 100%), vertical percentages across the groups (so the percentages in the column Aug 2023 add up to 100%), and horizontal percentages (so the percentages for each row add up to 100%). I would then want to have the figures in a DataFrame where the columns are: group, answer, month, count, and the 3 types of percentages.

I managed to do this, but I think there must be a better, more efficient and less verbose way of doing this.

The way I've done this is to create 3 crosstabs with different columns/rows/normalize options, stack them, reset the indexes and then merge them into a DataFrame:

df=pd.DataFrame(agg_counts)

vp_it=pd.crosstab(
    columns=[df['group'],df['month']],
    index=[df['answer']],
    values=df['count'],
    aggfunc=np.sum,
    normalize='columns'
    )
df_vp_it=vp_it.stack([0,1]).reset_index().rename(columns={0:"vp_inner"})

vp_ot=pd.crosstab(
    columns=[df['month']],
    index=[df['group'],df['answer']],
    values=df['count'],
    aggfunc=np.sum,
    normalize='columns'
    )
df_vp_ot=vp_ot.stack(0).reset_index().rename(columns={0:"vp_outer"})

hp=pd.crosstab(
    columns=[df['month']],
    index=[df['group'],df['answer']],
    values=df['count'],
    aggfunc=np.sum,
    normalize='index'
    )
df_hp=hp.stack([0]).reset_index().rename(columns={0:"hp"})

merge_columns=['group','answer','month']
df_merged=df.merge(df_vp_it,on=merge_columns).merge(df_vp_ot,on=merge_columns).merge(df_hp,on=merge_columns)

Is there a better way of creating the same output?


Solution

  • Transform the data into a pivot_table (what you call "cross table") in a single command:

    df = pd.DataFrame(agg_counts).pivot_table(index=['group', 'answer'], columns='month')
    

    Normalize the absolute numbers to relative percentages ...

    ... by dividing the elements of each group by each group's total (vertical group-wise normalization):

    df/df.groupby('group').sum()
    

    ... or by dividing all elements by each column's total (global vertical normalization):

    df.divide(df.sum(axis=0), axis=1)
    

    ... or by dividing all elements by each row's total (global horizontal normalization)

    df.divide(df.sum(axis=1), axis=0)