Search code examples
pythonpandasdataframeselectdelete-row

How to select specific data in a dataframe and delete all other data?


I have a dataframe in which I have two columns; representing a group (data type: int64) & a date (data type: "0"). I would like to extract each subset of data per group with the oldest date until a new date is issued. All other rows of the groups must be deleted.

DataFrame example:

import pandas as pd 
data = {'Col1':[1, 1, 1, 1, 1, 1, 1, 2, 2, 2 ], 'Col2':["2018-01-31", "2018-01-31", "2018-01-31", "2018-01-31", "2019-01-31", "2019-01-31", "2020-01-31", "2018-03-04", "2019-03-04", "2020-03-04"]}
df = pd.DataFrame(data)
print(df)

Desired result:

data = {'Col1':[1, 1, 1, 1, 2 ], 'Col2':["2018-01-31", "2018-01-31", "2018-01-31", "2018-01-31", "2018-03-04"]}
df = pd.DataFrame(data)
print(df)

I really hope somebody knows how to do this. Thanks in advance! :)


Solution

  • Convert to datetime, group by and rank:

    df['Col2'] = pd.to_datetime(df['Col2'])
    df['rank'] = df.groupby('Col1')['Col2'].rank(ascending=True,method='min')
    df
        Col1    Col2    rank
    0   1   2018-01-31  1.0
    1   1   2018-01-31  1.0
    2   1   2018-01-31  1.0
    3   1   2018-01-31  1.0
    4   1   2019-01-31  5.0
    5   1   2019-01-31  5.0
    6   1   2020-01-31  7.0
    7   2   2018-03-04  1.0
    8   2   2019-03-04  2.0
    9   2   2020-03-04  3.0
    

    You want to keep only rank of 1:

    df = df.loc[df['rank']==1]