Merging two dataframes based on dates that are overlapped using python

I have two dataframes, one that designates one characteristic and another that designates another characteristic, I wanted to join them, but in such a way that the result would be dependent of the intersection between dates.

df1:

	Key	in_date	out_date	char1
0	1000	01/01/2020	01/10/2020	A
1	1000	02/10/2020	10/12/2020	B
2	2000	01/01/2019	10/01/2019	C
3	2000	11/01/2019	01/10/2022	D
4	2000	02/10/2022	01/01/2023	B
5	2000	02/01/2023	31/12/2030	L

df2

	Key	st_date	end_date	char2
0	1000	01/05/2019	01/09/2020	G
1	1000	02/09/2020	10/11/2020	GG
2	2000	20/01/2019	15/02/2019	K
3	2000	16/02/2019	10/01/2022	GE
4	2000	11/01/2022	31/12/2030	GG

Desire result:

	key	start	end	char1	char2
0	1000	01/05/2019	31/12/2019	null	G
1	1000	01/01/2020	31/08/2020	A	G
2	1000	01/09/2020	01/10/2020	A	GG
3	1000	02/10/2020	10/11/2020	A	GG
4	1000	11/11/2020	10/12/2020	B	nan
5	2000	01/01/2019	10/01/2019	C	nan
6	2000	11/01/2019	19/01/2019	D	nan
7	2000	20/01/2019	15/02/2019	D	K
8	2000	16/02/2019	10/01/2022	D	GE
9	2000	11/01/2022	01/10/2022	D	GG
10	2000	02/10/2022	01/01/2023	B	GG
11	2000	02/01/2023	31/12/2030	L	GG

I tried to use many if and else, but when I tried to aggregate the dataframe, didn't work.

I tried to use pd.merge but I have a sparse matrix

Solution

I have a similar poblem in my line of work. The easiest way to tackle this problem is to synchronize both dataframes to ensure that all periods are the same. That means that if there is a break between two entries in one dataframe, you need to introduce it to the other as well.

This will lead to two dataframes that have the same values for in_time and out_time. Here is an example for a synchronizing function. The dataframes are transformed to dictionaries of the form:

df1 = {'Key': ['1000', '1000', '2000', '2000', '2000', '2000'], 'in_date': ['01/01/2020', '02/10/2020', '01/01/2019', '11/01/2019', '02/10/2022', '02/01/2023'], 'out_date': ['01/10/2020', '10/12/2020', '10/01/2019', '01/10/2022', '01/01/2023', '31/12/2030'], 'char1': ['A', 'B', 'C', 'D', 'B', 'L']}

df2 = {'Key': ['1000', '1000', '2000', '2000', '2000'], 'in_date': ['01/05/2019', '02/09/2020', '20/01/2019', '16/02/2019', '11/01/2022'], 'out_date': ['01/09/2020', '10/11/2020', '15/02/2019', '10/01/2022', '31/12/2030'], 'char2': ['G', 'GG', 'K', 'GE', 'GG']}

The function is:

def synchronize(df1,df2):
    cuts1 = {}
    cuts2 = {}

    ## Define the cutpoints for each Key

    for i,key in enumerate(df1["Key"]):
        cuts = [datetime.strptime(df1["in_date"][i],"%d/%m/%Y"),datetime.strptime(df1["out_date"][i],"%d/%m/%Y")]
        try:
            cuts1[key]+=cuts
        except:
            cuts1[key]=cuts
        cuts1[key]=sorted(cuts1[key])

    for i,key in enumerate(df2["Key"]):
        cuts = [datetime.strptime(df2["in_date"][i],"%d/%m/%Y"),datetime.strptime(df2["out_date"][i],"%d/%m/%Y")]
        try:
            cuts2[key]+=cuts
        except:
            cuts2[key]=cuts
        cuts2[key]=sorted(cuts2[key])

    ## Now we have two lists of sorted cutpoints to split periods in the other data frame
    ## Extend both data frames to adhere to these cuts:

    outdf1 = {}
    outdf2 = {}
    for v in df1.keys(): outdf1[v] = []
    for v in df2.keys(): outdf2[v] = []

    for i,key in enumerate(df1["Key"]):
        st,et = datetime.strptime(df1["in_date"][i],"%d/%m/%Y"),datetime.strptime(df1["out_date"][i],"%d/%m/%Y")
        rcuts = []
        for c in cuts2[key]:
            if c>st and c<et: ## Cut within this session
                rcuts.append(c)

        if len(rcuts)==0:
            periods = [(st,et)]
        else:
            periods = [(st,rcuts[0])]
            for ci in range(len(rcuts)-1):
                periods.append((rcuts[ci],rcuts[ci+1]))
            periods.append((rcuts[-1],et))

        for p in periods:
            outdf1["Key"].append(key)
            outdf1["in_date"].append(p[0].strftime("%d/%m/%Y"))
            outdf1["out_date"].append(p[1].strftime("%d/%m/%Y"))
            outdf1["char1"].append(df1["char1"][i])

    for i,key in enumerate(df2["Key"]):
        st,et = datetime.strptime(df2["in_date"][i],"%d/%m/%Y"),datetime.strptime(df2["out_date"][i],"%d/%m/%Y")
        rcuts = []
        for c in cuts1[key]:
            if c>st and c<et: ## Cut within this session
                rcuts.append(c)

        if len(rcuts)==0:
            periods = [(st,et)]
        else:
            periods = [(st,rcuts[0])]
            ci=0
            for ci in range(len(rcuts)-1):
                periods.append((rcuts[ci],rcuts[ci+1]))
            periods.append((rcuts[-1],et))

        for p in periods:
            outdf2["Key"].append(key)
            outdf2["in_date"].append(p[0].strftime("%d/%m/%Y"))
            outdf2["out_date"].append(p[1].strftime("%d/%m/%Y"))
            outdf2["char2"].append(df2["char2"][i])

    return outdf1,outdf2

Calling this function as: new_df1,new_df2 = synchronize(df1,df2) will result in two new dataframes with the following contents:

#new_df1
Key     in_date     out_date    char1
1000    01/01/2020  01/09/2020  A
1000    01/09/2020  02/09/2020  A
1000    02/09/2020  01/10/2020  A
1000    02/10/2020  10/11/2020  B
1000    10/11/2020  10/12/2020  B
2000    01/01/2019  10/01/2019  C
2000    11/01/2019  20/01/2019  D
2000    20/01/2019  15/02/2019  D
2000    15/02/2019  16/02/2019  D
2000    16/02/2019  10/01/2022  D
2000    10/01/2022  11/01/2022  D
2000    11/01/2022  01/10/2022  D
2000    02/10/2022  01/01/2023  B
2000    02/01/2023  31/12/2030  L

## new_df2
Key     in_date     out_date    char2
1000    01/05/2019  01/01/2020  G
1000    01/01/2020  01/09/2020  G
1000    02/09/2020  01/10/2020  GG
1000    01/10/2020  02/10/2020  GG
1000    02/10/2020  10/11/2020  GG
2000    20/01/2019  15/02/2019  K
2000    16/02/2019  10/01/2022  GE
2000    11/01/2022  01/10/2022  GG
2000    01/10/2022  02/10/2022  GG
2000    02/10/2022  01/01/2023  GG
2000    01/01/2023  02/01/2023  GG
2000    02/01/2023  31/12/2030  GG

Now, you can merge the two dataframes using the Key, in_date and out_date as keys.

Note that there are 1-day periods where nothing was defined. For example, the period for Key 1000 and char A in dataframe1 is interrupted by the end of a period on 1.9.2020 and a beginning of another one on 2.9.2020.

After merging, you may need to clean up the dates, subtracting one day from out_time timestamps where they are the same as the next in_time timestamp and removing the now existing 0-day periods. But you can only do this after merging, as the dates would not match up, otherwise.