Suppose I have the results of a simple survey, where the respondents are grouped into groups A, B and C, and were asked a simple yes/no question. I have responses to this survey over 3 months (Aug, Sep and Oct 2023).
I have the data in the following format:
agg_counts=[
{"group":'A','answer':'yes','month':'Aug 2023','count':12},
{"group":'A','answer':'yes','month':'Sep 2023','count':5},
{"group":'A','answer':'yes','month':'Oct 2023','count':4},
{"group":'A','answer':'no','month':'Aug 2023','count':9},
{"group":'A','answer':'no','month':'Sep 2023','count':10},
{"group":'A','answer':'no','month':'Oct 2023','count':3},
{"group":'B','answer':'yes','month':'Aug 2023','count':21},
{"group":'B','answer':'yes','month':'Sep 2023','count':3},
{"group":'B','answer':'yes','month':'Oct 2023','count':6},
{"group":'B','answer':'no','month':'Aug 2023','count':8},
{"group":'B','answer':'no','month':'Sep 2023','count':9},
{"group":'B','answer':'no','month':'Oct 2023','count':2},
{"group":'C','answer':'yes','month':'Aug 2023','count':11},
{"group":'C','answer':'yes','month':'Sep 2023','count':4},
{"group":'C','answer':'yes','month':'Oct 2023','count':9},
{"group":'C','answer':'no','month':'Aug 2023','count':8},
{"group":'C','answer':'no','month':'Sep 2023','count':1},
{"group":'C','answer':'no','month':'Oct 2023','count':13}
]
A crosstab with the months as columns look like this:
month Aug 2023 Oct 2023 Sep 2023
group answer
A no 9 3 10
yes 12 4 5
B no 8 2 9
yes 21 6 3
C no 8 13 1
yes 11 9 4
I would like to calculate vertical percentages within the groups (so the percentages within group A for Aug 2023 add up to 100%), vertical percentages across the groups (so the percentages in the column Aug 2023 add up to 100%), and horizontal percentages (so the percentages for each row add up to 100%). I would then want to have the figures in a DataFrame where the columns are: group, answer, month, count, and the 3 types of percentages.
I managed to do this, but I think there must be a better, more efficient and less verbose way of doing this.
The way I've done this is to create 3 crosstabs with different columns/rows/normalize options, stack them, reset the indexes and then merge them into a DataFrame:
df=pd.DataFrame(agg_counts)
vp_it=pd.crosstab(
columns=[df['group'],df['month']],
index=[df['answer']],
values=df['count'],
aggfunc=np.sum,
normalize='columns'
)
df_vp_it=vp_it.stack([0,1]).reset_index().rename(columns={0:"vp_inner"})
vp_ot=pd.crosstab(
columns=[df['month']],
index=[df['group'],df['answer']],
values=df['count'],
aggfunc=np.sum,
normalize='columns'
)
df_vp_ot=vp_ot.stack(0).reset_index().rename(columns={0:"vp_outer"})
hp=pd.crosstab(
columns=[df['month']],
index=[df['group'],df['answer']],
values=df['count'],
aggfunc=np.sum,
normalize='index'
)
df_hp=hp.stack([0]).reset_index().rename(columns={0:"hp"})
merge_columns=['group','answer','month']
df_merged=df.merge(df_vp_it,on=merge_columns).merge(df_vp_ot,on=merge_columns).merge(df_hp,on=merge_columns)
Is there a better way of creating the same output?
Transform the data into a pivot_table (what you call "cross table") in a single command:
df = pd.DataFrame(agg_counts).pivot_table(index=['group', 'answer'], columns='month')
Normalize the absolute numbers to relative percentages ...
... by dividing the elements of each group by each group's total (vertical group-wise normalization):
df/df.groupby('group').sum()
... or by dividing all elements by each column's total (global vertical normalization):
df.divide(df.sum(axis=0), axis=1)
... or by dividing all elements by each row's total (global horizontal normalization)
df.divide(df.sum(axis=1), axis=0)