I have a dataframe with categorical variable and several attributes for each row:
df = ID C1 C2 .... Cn flag
1 a b c True
1 a c d False
1 q t p False
1 g y p False
2 r h k False
For each column, and for each value of ID, I want the most common (and num of the occurences) and the last value. I am also want to know if it has at least one 'True' in the column flag. So I will get:
out = ID c1_common c1_common_n c1_latest c2_common c2_common_n c2_latest ... has_flag
1 a 2 g y 1 y True
2 r 1 r h 1 h False
What is the best way to do so?
Update
adding a new column (flag)
out = (df.filter(regex='[^flag]').melt('ID')
.groupby(['ID', 'variable'])
.agg(common=('value', lambda x: x.mode().head(1)),
common_n=('value', lambda x: x.value_counts().head(1)),
latest=('value', 'last')).unstack().swaplevel(axis=1)
.sort_index(axis=1).reset_index()
.assign(flag=df.groupby('ID')['flag'].max().tolist()))
out.columns = ['_'.join(x) if x[1] else x[0] for x in out.columns.to_flat_index()]
print(out)
# Output
ID C1_common C1_common_n C1_latest C2_common C2_common_n C2_latest Cn_common Cn_common_n Cn_latest flag
0 1 a 2 g b 1 y p 2 p True
1 2 r 1 r h 1 h k 1 k False
As your previous question, you have to flatten your dataframe before counting occurences. Finally, group by ID and Cx then aggregate some variables:
out = (df.melt('ID').value_counts()
.rename('count').reset_index('value')
.groupby(level=['ID', 'variable'])
.agg(common=('value', 'first'),
common_n=('count', 'first'),
latest=('value', 'last'))
.unstack('variable').swaplevel(axis=1).sort_index(axis=1))
out.columns = out.columns.to_flat_index().str.join('_')
Output:
>>> out
C1_common C1_common_n C1_latest C2_common C2_common_n C2_latest Cn_common Cn_common_n Cn_latest
ID
1 a 2 q b 1 y p 2 d
2 r 1 r h 1 h k 1 k