Search code examples
pythonpandasaggregateaggregate-functions

How to aggregate DataFrame to stay rows with the highest date and add new column in Python Pandas?


I have DataFrame in Python Pandas like below ("date_col" is in "datetime64" format):

ID  | date_col   | purchase
----|------------|-------
111 | 2019-01-05 | apple
111 | 2019-05-22 | onion
222 | 2020-11-04 | banana
333 | 2020-04-19 | orange

I need to aggregate above table in the following way:

  1. add column "col1" with number of purchases which was made by client ("ID")
  2. If some client ("ID") is duplicated - stay only one row with the highest date

So as a result I need something like below:

ID  | date_col   | purchase | col1
----|------------|----------|-----
111 | 2019-05-22 | onion    | 2
222 | 2020-11-04 | banana   | 1
333 | 2020-04-19 | orange   | 1

Solution

  • Assuming the dataframe is sorted on date_col column, you can use groupby:

    g = df.groupby('ID', as_index=False)
    g.last().merge(g.size())
    

        ID    date_col purchase  size
    0  111  2019-05-22    onion     2
    1  222  2020-11-04   banana     1
    2  333  2020-04-19   orange     1