Search code examples
pythonpandasdataframedata-analysis

How do I get the value of the column with the max data in a group by in Pandas?


I have a dataframe which has multiple rows for each particular user_id and I'm trying to create a new dataframe that aggregates the values for each user_id. For example, the original dataframe df is

id    user_id      eventdate      col1       col2
1      100         2015-10-01      2          A
2      200         2015-11-01      4          B
3      300         2015-12-01      1          A
4      100         2016-02-01      3          B
5      300         2016-03-01      6          C

From this I want a new dataframe df_agg for each user like this:

user_id      col1_sum        col2_latest
100          5               B
200          4               B
300          7               C

I got the col1_sum by doing a groupby like this:

df_agg = pd.DataFrame({'col1_sum': df.groupby('NAME')['col1'].sum()})

How do I add the col2_latest column to this dataframe?


Solution

  • You need aggregate by sum and last:

    df_agg = df.groupby('user_id').agg({'col1':'sum','col2':'last'})
    df_agg = df_agg[['col1','col2']].rename(columns={'col1':'col1_sum','col2':'col2_latest'})
                                    .reset_index()
    print (df_agg)
       user_id  col1_sum col2_latest
    0      100         5           B
    1      200         4           B
    2      300         7           C