Search code examples
pythonpandasmax

Obtain the highest codes repeating and non-repeating from the dataset


I have a dataset that looks like this:

id code date
1 39 20180527
1 17 20180223
1 17 20180223
1 17 20180223
1 30 20120612
1 14 20120214
2 40 20210605
2 32 20210412
2 25 20210315
3 39 20170504
3 17 20170205
3 40 20150506

As you can see, the highest codes have the highest date. I want to extract those codes and dates. Some of them are repeating. For instance, in id 1, the highest codes are 39 and 30, I want to gain those codes and the date associated with them. One thing to notice here is that there is one year gap between the values of the highest code dates. The output from the above dataset should be:

id code date
1 39 20180527
1 30 20120612
2 40 20210605
3 39 20170504
3 40 20150506

I tried to use the following code, but it gives me only the highest value and does not consider the duplicate values.

latest_dates = column_selection.groupby("id").max() # group the data by id and get the max date for each group

latest_dates = latest_dates.reset_index() # reset the index

latest_dates # print the latest date for each ID with the new index

This is the output I get using the above code:

id code date
1 39 20180527
2 40 20210605
3 39 20170504

I would greatly appreciate any help.


Solution

  • use .nlargest(2) and .loc to filter your index values.

    df.loc[df.groupby(['id'])['code'].nlargest(2).index.get_level_values(1)]
    
        id  code      date
    0    1    39  20180527
    4    1    30  20120612
    6    2    40  20210605
    7    2    32  20210412
    11   3    40  20150506
    9    3    39  20170504
    

    Another method would be to use .sort_values and .cumcount with a boolean filter.

     df.loc[df.sort_values('code',ascending=False).groupby('id').cumcount().le(1)]
    
    
    
        id  code      date
    0    1    39  20180527
    4    1    30  20120612
    6    2    40  20210605
    7    2    32  20210412
    9    3    39  20170504
    11   3    40  20150506
    

    Edit - handling year diffs.

    After you've got your dataframe, we can turn the date field into a proper date and apply a diff at an id level after sorting the date field

    you can then create a secondary key to de-dupe your values that less than a years difference between them.

    import numpy as np
    
    df['date'] = pd.to_datetime(df['date'], format='%Y%m%d')
    
    s = df.loc[df.groupby(['id'])['code'].nlargest(2).index.get_level_values(1)]
    
    
    con = (s.sort_values(['id','date']
               ).groupby('id')['date'].diff() / np.timedelta64(1,'Y')).fillna(1)
    
    
    s['grp'] = s.groupby(['id', con.lt(1)]).cumcount()
    
    out = s.sort_values(['id', 'code']).drop_duplicates(subset=['id', 'grp'],keep='last').drop('grp',axis=1)
    
    
        id  code       date
    4    1    30 2012-06-12
    0    1    39 2018-05-27
    6    2    40 2021-06-05
    9    3    39 2017-05-04
    11   3    40 2015-05-06
    

    diving a little deeper, we need two conditions to filter by.

    1. any values that have less than a year delta between them.

    2. of those values take the higher of the two.