python pandas dataframe select delete-row

How to select specific data in a dataframe and delete all other data?

I have a dataframe in which I have two columns; representing a group (data type: int64) & a date (data type: "0"). I would like to extract each subset of data per group with the oldest date until a new date is issued. All other rows of the groups must be deleted.

DataFrame example:

import pandas as pd 
data = {'Col1':[1, 1, 1, 1, 1, 1, 1, 2, 2, 2 ], 'Col2':["2018-01-31", "2018-01-31", "2018-01-31", "2018-01-31", "2019-01-31", "2019-01-31", "2020-01-31", "2018-03-04", "2019-03-04", "2020-03-04"]}
df = pd.DataFrame(data)
print(df)

Desired result:

data = {'Col1':[1, 1, 1, 1, 2 ], 'Col2':["2018-01-31", "2018-01-31", "2018-01-31", "2018-01-31", "2018-03-04"]}
df = pd.DataFrame(data)
print(df)

I really hope somebody knows how to do this. Thanks in advance! :)

Solution

Convert to datetime, group by and rank:

df['Col2'] = pd.to_datetime(df['Col2'])
df['rank'] = df.groupby('Col1')['Col2'].rank(ascending=True,method='min')
df
    Col1    Col2    rank
0   1   2018-01-31  1.0
1   1   2018-01-31  1.0
2   1   2018-01-31  1.0
3   1   2018-01-31  1.0
4   1   2019-01-31  5.0
5   1   2019-01-31  5.0
6   1   2020-01-31  7.0
7   2   2018-03-04  1.0
8   2   2019-03-04  2.0
9   2   2020-03-04  3.0

You want to keep only rank of 1:

df = df.loc[df['rank']==1]