Obtain the highest codes repeating and non-repeating from the dataset

I have a dataset that looks like this:

id	code	date
1	39	20180527
1	17	20180223
1	17	20180223
1	17	20180223
1	30	20120612
1	14	20120214
2	40	20210605
2	32	20210412
2	25	20210315
3	39	20170504
3	17	20170205
3	40	20150506

As you can see, the highest codes have the highest date. I want to extract those codes and dates. Some of them are repeating. For instance, in id 1, the highest codes are 39 and 30, I want to gain those codes and the date associated with them. One thing to notice here is that there is one year gap between the values of the highest code dates. The output from the above dataset should be:

id	code	date
1	39	20180527
1	30	20120612
2	40	20210605
3	39	20170504
3	40	20150506

I tried to use the following code, but it gives me only the highest value and does not consider the duplicate values.

latest_dates = column_selection.groupby("id").max() # group the data by id and get the max date for each group

latest_dates = latest_dates.reset_index() # reset the index

latest_dates # print the latest date for each ID with the new index

This is the output I get using the above code:

id	code	date
1	39	20180527
2	40	20210605
3	39	20170504

I would greatly appreciate any help.

Solution

use .nlargest(2) and .loc to filter your index values.

df.loc[df.groupby(['id'])['code'].nlargest(2).index.get_level_values(1)]

    id  code      date
0    1    39  20180527
4    1    30  20120612
6    2    40  20210605
7    2    32  20210412
11   3    40  20150506
9    3    39  20170504

Another method would be to use .sort_values and .cumcount with a boolean filter.

 df.loc[df.sort_values('code',ascending=False).groupby('id').cumcount().le(1)]



    id  code      date
0    1    39  20180527
4    1    30  20120612
6    2    40  20210605
7    2    32  20210412
9    3    39  20170504
11   3    40  20150506

Edit - handling year diffs.

After you've got your dataframe, we can turn the date field into a proper date and apply a diff at an id level after sorting the date field

you can then create a secondary key to de-dupe your values that less than a years difference between them.

import numpy as np

df['date'] = pd.to_datetime(df['date'], format='%Y%m%d')

s = df.loc[df.groupby(['id'])['code'].nlargest(2).index.get_level_values(1)]


con = (s.sort_values(['id','date']
           ).groupby('id')['date'].diff() / np.timedelta64(1,'Y')).fillna(1)


s['grp'] = s.groupby(['id', con.lt(1)]).cumcount()

out = s.sort_values(['id', 'code']).drop_duplicates(subset=['id', 'grp'],keep='last').drop('grp',axis=1)


    id  code       date
4    1    30 2012-06-12
0    1    39 2018-05-27
6    2    40 2021-06-05
9    3    39 2017-05-04
11   3    40 2015-05-06

diving a little deeper, we need two conditions to filter by.

any values that have less than a year delta between them.
of those values take the higher of the two.