Search code examples
python-3.xpandasdataframegroup-by

Pandas get most common and last value in certain category, after groupby


I have a dataframe with categorical variable and several attributes for each row:

df = ID  C1  C2  .... Cn   flag
      1  a   b         c   True
      1  a   c         d   False
      1  q   t         p   False
      1  g   y         p   False
      2  r   h         k   False

For each column, and for each value of ID, I want the most common (and num of the occurences) and the last value. I am also want to know if it has at least one 'True' in the column flag. So I will get:

out = ID c1_common c1_common_n c1_latest c2_common c2_common_n c2_latest ...  has_flag 
       1    a          2             g         y        1           y          True
       2    r          1             r         h        1           h          False

What is the best way to do so?


Solution

  • Update

    adding a new column (flag)

    out = (df.filter(regex='[^flag]').melt('ID')
             .groupby(['ID', 'variable'])
             .agg(common=('value', lambda x: x.mode().head(1)),
                  common_n=('value', lambda x: x.value_counts().head(1)),
                  latest=('value', 'last')).unstack().swaplevel(axis=1)
             .sort_index(axis=1).reset_index()
             .assign(flag=df.groupby('ID')['flag'].max().tolist()))
    
    out.columns = ['_'.join(x) if x[1] else x[0] for x in out.columns.to_flat_index()]
    print(out)
    
    # Output
       ID C1_common  C1_common_n C1_latest C2_common  C2_common_n C2_latest Cn_common  Cn_common_n Cn_latest   flag
    0   1         a            2         g         b            1         y         p            2         p   True
    1   2         r            1         r         h            1         h         k            1         k  False
    

    As your previous question, you have to flatten your dataframe before counting occurences. Finally, group by ID and Cx then aggregate some variables:

    out = (df.melt('ID').value_counts()
             .rename('count').reset_index('value')
             .groupby(level=['ID', 'variable'])
             .agg(common=('value', 'first'),
                  common_n=('count', 'first'),
                  latest=('value', 'last'))
             .unstack('variable').swaplevel(axis=1).sort_index(axis=1))
    
    out.columns = out.columns.to_flat_index().str.join('_')
    

    Output:

    >>> out
    
       C1_common  C1_common_n C1_latest C2_common  C2_common_n C2_latest Cn_common  Cn_common_n Cn_latest
    ID                                                                                                   
    1          a            2         q         b            1         y         p            2         d
    2          r            1         r         h            1         h         k            1         k