Search code examples
python-3.xpandasdataframedata-analysisdata-processing

Filter rows of 1st Dataframe from the 2nd Dataframe having different starting dates


I have two dataframes from which a new dataframe has to be created. The first one is given below.

data = {'ID':['A', 'A', 'A', 'A', 'A', 'B','B','B','B', 'C','C','C','C','C','C', 'D','D','D'],
    'Date':['2021-2-13', '2021-2-14', '2021-2-15', '2021-2-16', '2021-2-17', '2021-2-16', '2021-2-17', '2021-2-18', '2021-2-19',
           '2021-2-12', '2021-2-13', '2021-2-14', '2021-2-15', '2021-2-16','2021-2-17', '2021-2-14', '2021-2-15', '2021-2-16'],
   'Steps': [1000, 1200, 1500, 2000, 1400, 4000,3400, 5000,1000, 3500,4000,5000,5300,2000,3500, 5000,5500,5200 ]}
df1 = pd.DataFrame(data)

df1

The image of this is also attached.

DataFrame 1

The 2nd dataframe contains the starting date of each participant as given and shown below.

data1 = {'ID':['A', 'B', 'C', 'D'],
    'Date':['2021-2-15', '2021-2-17', '2021-2-16', '2021-2-15']}
df2 = pd.DataFrame(data1)

df2

The snippet of it is given below.

enter image description here

Now, the resulting dataframe have to be such that for each participant in the Dataframe1, the rows have to start from the dates given in the 2nd Dataframe. The rows prior to that starting date have to be deleted.

The final dataframe as in how it should look is given below.

enter image description here

Any help is greatly appreciated. Thanks


Solution

  • You can use .merge + boolean-indexing:

    df1["Date"] = pd.to_datetime(df1["Date"])
    df2["Date"] = pd.to_datetime(df2["Date"])
    
    x = df1.merge(df2, on="ID", suffixes=("", "_y"))
    print(x.loc[x.Date >= x.Date_y, df1.columns].reset_index(drop=True))
    

    Prints:

      ID       Date  Steps
    0  A 2021-02-15   1500
    1  A 2021-02-16   2000
    2  A 2021-02-17   1400
    3  B 2021-02-17   3400
    4  B 2021-02-18   5000
    5  B 2021-02-19   1000
    6  C 2021-02-16   2000
    7  C 2021-02-17   3500
    8  D 2021-02-15   5500
    9  D 2021-02-16   5200
    

    Or: If some ID is missing in df2:

    x = df1.merge(df2, on="ID", suffixes=("", "_y"), how="outer").fillna(pd.Timestamp(0))
    print(x.loc[x.Date >= x.Date_y, df1.columns].reset_index(drop=True))