Search code examples
pythondatejoinmerge

Merging two dataframes based on dates that are overlapped using python


I have two dataframes, one that designates one characteristic and another that designates another characteristic, I wanted to join them, but in such a way that the result would be dependent of the intersection between dates.

df1:

Key in_date out_date char1
0 1000 01/01/2020 01/10/2020 A
1 1000 02/10/2020 10/12/2020 B
2 2000 01/01/2019 10/01/2019 C
3 2000 11/01/2019 01/10/2022 D
4 2000 02/10/2022 01/01/2023 B
5 2000 02/01/2023 31/12/2030 L

df2

Key st_date end_date char2
0 1000 01/05/2019 01/09/2020 G
1 1000 02/09/2020 10/11/2020 GG
2 2000 20/01/2019 15/02/2019 K
3 2000 16/02/2019 10/01/2022 GE
4 2000 11/01/2022 31/12/2030 GG

Desire result:

key start end char1 char2
0 1000 01/05/2019 31/12/2019 null G
1 1000 01/01/2020 31/08/2020 A G
2 1000 01/09/2020 01/10/2020 A GG
3 1000 02/10/2020 10/11/2020 A GG
4 1000 11/11/2020 10/12/2020 B nan
5 2000 01/01/2019 10/01/2019 C nan
6 2000 11/01/2019 19/01/2019 D nan
7 2000 20/01/2019 15/02/2019 D K
8 2000 16/02/2019 10/01/2022 D GE
9 2000 11/01/2022 01/10/2022 D GG
10 2000 02/10/2022 01/01/2023 B GG
11 2000 02/01/2023 31/12/2030 L GG

I tried to use many if and else, but when I tried to aggregate the dataframe, didn't work.

I tried to use pd.merge but I have a sparse matrix


Solution

  • I have a similar poblem in my line of work. The easiest way to tackle this problem is to synchronize both dataframes to ensure that all periods are the same. That means that if there is a break between two entries in one dataframe, you need to introduce it to the other as well.

    This will lead to two dataframes that have the same values for in_time and out_time. Here is an example for a synchronizing function. The dataframes are transformed to dictionaries of the form:

    df1 = {'Key': ['1000', '1000', '2000', '2000', '2000', '2000'], 'in_date': ['01/01/2020', '02/10/2020', '01/01/2019', '11/01/2019', '02/10/2022', '02/01/2023'], 'out_date': ['01/10/2020', '10/12/2020', '10/01/2019', '01/10/2022', '01/01/2023', '31/12/2030'], 'char1': ['A', 'B', 'C', 'D', 'B', 'L']}
    
    df2 = {'Key': ['1000', '1000', '2000', '2000', '2000'], 'in_date': ['01/05/2019', '02/09/2020', '20/01/2019', '16/02/2019', '11/01/2022'], 'out_date': ['01/09/2020', '10/11/2020', '15/02/2019', '10/01/2022', '31/12/2030'], 'char2': ['G', 'GG', 'K', 'GE', 'GG']}
    

    The function is:

    def synchronize(df1,df2):
        cuts1 = {}
        cuts2 = {}
    
        ## Define the cutpoints for each Key
    
        for i,key in enumerate(df1["Key"]):
            cuts = [datetime.strptime(df1["in_date"][i],"%d/%m/%Y"),datetime.strptime(df1["out_date"][i],"%d/%m/%Y")]
            try:
                cuts1[key]+=cuts
            except:
                cuts1[key]=cuts
            cuts1[key]=sorted(cuts1[key])
    
        for i,key in enumerate(df2["Key"]):
            cuts = [datetime.strptime(df2["in_date"][i],"%d/%m/%Y"),datetime.strptime(df2["out_date"][i],"%d/%m/%Y")]
            try:
                cuts2[key]+=cuts
            except:
                cuts2[key]=cuts
            cuts2[key]=sorted(cuts2[key])
    
        ## Now we have two lists of sorted cutpoints to split periods in the other data frame
        ## Extend both data frames to adhere to these cuts:
    
        outdf1 = {}
        outdf2 = {}
        for v in df1.keys(): outdf1[v] = []
        for v in df2.keys(): outdf2[v] = []
    
        for i,key in enumerate(df1["Key"]):
            st,et = datetime.strptime(df1["in_date"][i],"%d/%m/%Y"),datetime.strptime(df1["out_date"][i],"%d/%m/%Y")
            rcuts = []
            for c in cuts2[key]:
                if c>st and c<et: ## Cut within this session
                    rcuts.append(c)
    
            if len(rcuts)==0:
                periods = [(st,et)]
            else:
                periods = [(st,rcuts[0])]
                for ci in range(len(rcuts)-1):
                    periods.append((rcuts[ci],rcuts[ci+1]))
                periods.append((rcuts[-1],et))
    
            for p in periods:
                outdf1["Key"].append(key)
                outdf1["in_date"].append(p[0].strftime("%d/%m/%Y"))
                outdf1["out_date"].append(p[1].strftime("%d/%m/%Y"))
                outdf1["char1"].append(df1["char1"][i])
    
        for i,key in enumerate(df2["Key"]):
            st,et = datetime.strptime(df2["in_date"][i],"%d/%m/%Y"),datetime.strptime(df2["out_date"][i],"%d/%m/%Y")
            rcuts = []
            for c in cuts1[key]:
                if c>st and c<et: ## Cut within this session
                    rcuts.append(c)
    
            if len(rcuts)==0:
                periods = [(st,et)]
            else:
                periods = [(st,rcuts[0])]
                ci=0
                for ci in range(len(rcuts)-1):
                    periods.append((rcuts[ci],rcuts[ci+1]))
                periods.append((rcuts[-1],et))
    
            for p in periods:
                outdf2["Key"].append(key)
                outdf2["in_date"].append(p[0].strftime("%d/%m/%Y"))
                outdf2["out_date"].append(p[1].strftime("%d/%m/%Y"))
                outdf2["char2"].append(df2["char2"][i])
    
        return outdf1,outdf2
    

    Calling this function as: new_df1,new_df2 = synchronize(df1,df2) will result in two new dataframes with the following contents:

    #new_df1
    Key     in_date     out_date    char1
    1000    01/01/2020  01/09/2020  A
    1000    01/09/2020  02/09/2020  A
    1000    02/09/2020  01/10/2020  A
    1000    02/10/2020  10/11/2020  B
    1000    10/11/2020  10/12/2020  B
    2000    01/01/2019  10/01/2019  C
    2000    11/01/2019  20/01/2019  D
    2000    20/01/2019  15/02/2019  D
    2000    15/02/2019  16/02/2019  D
    2000    16/02/2019  10/01/2022  D
    2000    10/01/2022  11/01/2022  D
    2000    11/01/2022  01/10/2022  D
    2000    02/10/2022  01/01/2023  B
    2000    02/01/2023  31/12/2030  L
    
    ## new_df2
    Key     in_date     out_date    char2
    1000    01/05/2019  01/01/2020  G
    1000    01/01/2020  01/09/2020  G
    1000    02/09/2020  01/10/2020  GG
    1000    01/10/2020  02/10/2020  GG
    1000    02/10/2020  10/11/2020  GG
    2000    20/01/2019  15/02/2019  K
    2000    16/02/2019  10/01/2022  GE
    2000    11/01/2022  01/10/2022  GG
    2000    01/10/2022  02/10/2022  GG
    2000    02/10/2022  01/01/2023  GG
    2000    01/01/2023  02/01/2023  GG
    2000    02/01/2023  31/12/2030  GG
    

    Now, you can merge the two dataframes using the Key, in_date and out_date as keys.

    Note that there are 1-day periods where nothing was defined. For example, the period for Key 1000 and char A in dataframe1 is interrupted by the end of a period on 1.9.2020 and a beginning of another one on 2.9.2020.

    After merging, you may need to clean up the dates, subtracting one day from out_time timestamps where they are the same as the next in_time timestamp and removing the now existing 0-day periods. But you can only do this after merging, as the dates would not match up, otherwise.