Search code examples
pythonpandasgroup-byunique

pandas groupby: selecting unique latest entries


In the following pandas Data Frame:

    Name    v   date_modified
0   A   0   2023-01-01
1   A   1   2023-01-02
2   A   2   2023-01-03
3   B   0   2023-01-30
4   B   1   2023-01-02
5   B   2   2023-01-03
6   C   0   2023-01-30
7   C   1   2023-01-03
8   C   2   2023-01-03

How can I get two latest versions with most recent unique date_modified per group ['Name', 'v']?

In this example there are duplicates date_modified on df.Name == C. So far I tired to do something like this: df.sort_values('date_modified').groupby(['Name', 'v']).tail(2). This does not omit duplicates on date_modified and also for some reason return all rows not just tail of two


Solution

  • IIUC, you have to drop some duplicates before:

    >>> (df.drop_duplicates(['Name', 'date_modified'], keep='first')
           .sort_values('date_modified').groupby('Name').tail(2).sort_index())
    
      Name  v date_modified
    1    A  1    2023-01-02
    2    A  2    2023-01-03
    3    B  0    2023-01-30
    5    B  2    2023-01-03
    6    C  0    2023-01-30
    7    C  1    2023-01-03