Search code examples
pythonpandasdataframeduplicatesunique

Keep first occurrence row by id and first occurrence when value in column changes


In the following example df, what is the best approach to keep:

  • The first row when a Score appears for each id
  • Then the first row when a value changes in Score for each id and drop duplicated rows until it changes

Example df

      date      id   Score
0   2001-09-06  1       3
1   2001-09-07  1       3
2   2001-09-08  1       4
3   2001-09-09  2       6
4   2001-09-10  2       6
5   2001-09-11  1       4
6   2001-09-12  2       5
7   2001-09-13  2       5
8   2001-09-14  1       3

Desired df

      date      id   Score
0   2001-09-06  1       3
1   2001-09-08  1       4
2   2001-09-09  2       6
3   2001-09-12  2       5
4   2001-09-14  1       3

Solution

  • Use groupby with diff:

    print (df[df.groupby("id")["Score"].diff()!=0])
    
             date  id  Score
    0  2001-09-06   1      3
    2  2001-09-08   1      4
    3  2001-09-09   2      6
    6  2001-09-12   2      5
    8  2001-09-14   1      3
    

    The first appearance will always result in NaN which !=0.