I have the following dataframe:
d = pd.DataFrame({'UNIQUE_KEY': [1, 2, 3, 4], 'TRANSFORMATION': ['P', 'D', 'N', 'P'],
'DIM_1': ['Y', 'N', 'N', 'Y'], 'DIM_2': ['N', 'N', 'N', 'Y'], 'DIM_3': ['Y', 'Y', 'N', 'Y']})
UNIQUE_KEY TRANSFORMATION DIM_1 DIM_2 DIM_3
0 1 P Y N Y
1 2 D N N Y
2 3 N N N N
3 4 P Y Y Y
I want to perform several groupby
and aggregate operations in order to get the following output dataframe:
DIM DIM_VALUE TTL_CASES % CASES % D % N % P
0 DIM_1 'Y' 2 50 0 0 100
1 DIM_1 'N' 2 50 50 50 0
2 DIM_2 'Y' 1 25 0 0 100
3 DIM_2 'N' 3 75 33.3 33.3 33.3
4 DIM_3 'Y' 3 75 33.3 0 66.6
5 DIM_3 'N' 1 25 0 100 0
Where
DIM
is a column with each of DIM_1,2,3
DIM_VALUE
is a grouped column based on the values of each DIM_1,2,3
TTL_CASES
is a column with the count of UNIQUE_KEY
grouped by DIM
and DIM_1,2,3
PCT_CASES
is the percentage of each row of TTL_CASES
%D
, %P
, %N
are the percentages of TRANSFORMATION
of UNIQUE_KEY
based on the the grouped by DIM
and DIM_1,2,3
What I have is the following:
P = d.groupby('TRANSFORMATION')['UNIQUE_KEY'].count().reset_index()
P['Percentage'] = 100 * P['UNIQUE_KEY'] / P['UNIQUE_KEY'].sum()
which gives me the percentage of each value in TRANFORMATION
but how do I do this for each dimension and get an output dataframe in the format I want?
Thanks in advance!
Seems like there is no easy way to achieve the desired output, you need to break the problem down into multiple steps: First melt the dataframe to convert DIM like columns to rows, then create two grouper objects, calculate the aggregations using the grouper objects as required and assign the results to output dataframe
s = d.melt(['UNIQUE_KEY', 'TRANSFORMATION'], var_name='DIM', value_name='DIM_VALUE')
g1 = s.groupby(['DIM'])
g2 = s.groupby(['DIM', 'DIM_VALUE'])
out = pd.DataFrame()
out['TTL_CASES'] = g2['UNIQUE_KEY'].count()
out['%_TTL_CASES'] = out['TTL_CASES'].div(g1['UNIQUE_KEY'].count()).mul(100)
out = out.assign(**g2['TRANSFORMATION'].value_counts(normalize=True).mul(100).unstack(fill_value=0).add_prefix('%'))
Result
DIM DIM_VALUE TTL_CASES %_TTL_CASES %D %N %P
0 DIM_1 N 2 50.0 50.000000 50.000000 0.000000
1 DIM_1 Y 2 50.0 0.000000 0.000000 100.000000
2 DIM_2 N 3 75.0 33.333333 33.333333 33.333333
3 DIM_2 Y 1 25.0 0.000000 0.000000 100.000000
4 DIM_3 N 1 25.0 0.000000 100.000000 0.000000
5 DIM_3 Y 3 75.0 33.333333 0.000000 66.666667