I have a dataset that looks like this:
id | code | date |
---|---|---|
1 | 39 | 20180527 |
1 | 17 | 20180223 |
1 | 17 | 20180223 |
1 | 17 | 20180223 |
1 | 30 | 20120612 |
1 | 14 | 20120214 |
2 | 40 | 20210605 |
2 | 32 | 20210412 |
2 | 25 | 20210315 |
3 | 39 | 20170504 |
3 | 17 | 20170205 |
3 | 40 | 20150506 |
As you can see, the highest codes have the highest date. I want to extract those codes and dates. Some of them are repeating. For instance, in id 1, the highest codes are 39 and 30, I want to gain those codes and the date associated with them. One thing to notice here is that there is one year gap between the values of the highest code dates. The output from the above dataset should be:
id | code | date |
---|---|---|
1 | 39 | 20180527 |
1 | 30 | 20120612 |
2 | 40 | 20210605 |
3 | 39 | 20170504 |
3 | 40 | 20150506 |
I tried to use the following code, but it gives me only the highest value and does not consider the duplicate values.
latest_dates = column_selection.groupby("id").max() # group the data by id and get the max date for each group
latest_dates = latest_dates.reset_index() # reset the index
latest_dates # print the latest date for each ID with the new index
This is the output I get using the above code:
id | code | date |
---|---|---|
1 | 39 | 20180527 |
2 | 40 | 20210605 |
3 | 39 | 20170504 |
I would greatly appreciate any help.
use .nlargest(2)
and .loc
to filter your index values.
df.loc[df.groupby(['id'])['code'].nlargest(2).index.get_level_values(1)]
id code date
0 1 39 20180527
4 1 30 20120612
6 2 40 20210605
7 2 32 20210412
11 3 40 20150506
9 3 39 20170504
Another method would be to use .sort_values
and .cumcount
with a boolean filter.
df.loc[df.sort_values('code',ascending=False).groupby('id').cumcount().le(1)]
id code date
0 1 39 20180527
4 1 30 20120612
6 2 40 20210605
7 2 32 20210412
9 3 39 20170504
11 3 40 20150506
After you've got your dataframe, we can turn the date
field into a proper date and apply a diff
at an id
level after sorting the date
field
you can then create a secondary key to de-dupe your values that less than a years difference between them.
import numpy as np
df['date'] = pd.to_datetime(df['date'], format='%Y%m%d')
s = df.loc[df.groupby(['id'])['code'].nlargest(2).index.get_level_values(1)]
con = (s.sort_values(['id','date']
).groupby('id')['date'].diff() / np.timedelta64(1,'Y')).fillna(1)
s['grp'] = s.groupby(['id', con.lt(1)]).cumcount()
out = s.sort_values(['id', 'code']).drop_duplicates(subset=['id', 'grp'],keep='last').drop('grp',axis=1)
id code date
4 1 30 2012-06-12
0 1 39 2018-05-27
6 2 40 2021-06-05
9 3 39 2017-05-04
11 3 40 2015-05-06
diving a little deeper, we need two conditions to filter by.
any values that have less
than a year
delta between them.
of those values take the higher of the two.