I have a dataframe in which I have two columns; representing a group (data type: int64) & a date (data type: "0"). I would like to extract each subset of data per group with the oldest date until a new date is issued. All other rows of the groups must be deleted.
DataFrame example:
import pandas as pd
data = {'Col1':[1, 1, 1, 1, 1, 1, 1, 2, 2, 2 ], 'Col2':["2018-01-31", "2018-01-31", "2018-01-31", "2018-01-31", "2019-01-31", "2019-01-31", "2020-01-31", "2018-03-04", "2019-03-04", "2020-03-04"]}
df = pd.DataFrame(data)
print(df)
Desired result:
data = {'Col1':[1, 1, 1, 1, 2 ], 'Col2':["2018-01-31", "2018-01-31", "2018-01-31", "2018-01-31", "2018-03-04"]}
df = pd.DataFrame(data)
print(df)
I really hope somebody knows how to do this. Thanks in advance! :)
Convert to datetime, group by and rank:
df['Col2'] = pd.to_datetime(df['Col2'])
df['rank'] = df.groupby('Col1')['Col2'].rank(ascending=True,method='min')
df
Col1 Col2 rank
0 1 2018-01-31 1.0
1 1 2018-01-31 1.0
2 1 2018-01-31 1.0
3 1 2018-01-31 1.0
4 1 2019-01-31 5.0
5 1 2019-01-31 5.0
6 1 2020-01-31 7.0
7 2 2018-03-04 1.0
8 2 2019-03-04 2.0
9 2 2020-03-04 3.0
You want to keep only rank of 1:
df = df.loc[df['rank']==1]