I have two dataframes, one that designates one characteristic and another that designates another characteristic, I wanted to join them, but in such a way that the result would be dependent of the intersection between dates.
df1:
Key | in_date | out_date | char1 | |
---|---|---|---|---|
0 | 1000 | 01/01/2020 | 01/10/2020 | A |
1 | 1000 | 02/10/2020 | 10/12/2020 | B |
2 | 2000 | 01/01/2019 | 10/01/2019 | C |
3 | 2000 | 11/01/2019 | 01/10/2022 | D |
4 | 2000 | 02/10/2022 | 01/01/2023 | B |
5 | 2000 | 02/01/2023 | 31/12/2030 | L |
df2
Key | st_date | end_date | char2 | |
---|---|---|---|---|
0 | 1000 | 01/05/2019 | 01/09/2020 | G |
1 | 1000 | 02/09/2020 | 10/11/2020 | GG |
2 | 2000 | 20/01/2019 | 15/02/2019 | K |
3 | 2000 | 16/02/2019 | 10/01/2022 | GE |
4 | 2000 | 11/01/2022 | 31/12/2030 | GG |
Desire result:
key | start | end | char1 | char2 | |
---|---|---|---|---|---|
0 | 1000 | 01/05/2019 | 31/12/2019 | null | G |
1 | 1000 | 01/01/2020 | 31/08/2020 | A | G |
2 | 1000 | 01/09/2020 | 01/10/2020 | A | GG |
3 | 1000 | 02/10/2020 | 10/11/2020 | A | GG |
4 | 1000 | 11/11/2020 | 10/12/2020 | B | nan |
5 | 2000 | 01/01/2019 | 10/01/2019 | C | nan |
6 | 2000 | 11/01/2019 | 19/01/2019 | D | nan |
7 | 2000 | 20/01/2019 | 15/02/2019 | D | K |
8 | 2000 | 16/02/2019 | 10/01/2022 | D | GE |
9 | 2000 | 11/01/2022 | 01/10/2022 | D | GG |
10 | 2000 | 02/10/2022 | 01/01/2023 | B | GG |
11 | 2000 | 02/01/2023 | 31/12/2030 | L | GG |
I tried to use many if and else, but when I tried to aggregate the dataframe, didn't work.
I tried to use pd.merge but I have a sparse matrix
I have a similar poblem in my line of work. The easiest way to tackle this problem is to synchronize both dataframes to ensure that all periods are the same. That means that if there is a break between two entries in one dataframe, you need to introduce it to the other as well.
This will lead to two dataframes that have the same values for in_time
and out_time
. Here is an example for a synchronizing function. The dataframes are transformed to dictionaries of the form:
df1 = {'Key': ['1000', '1000', '2000', '2000', '2000', '2000'], 'in_date': ['01/01/2020', '02/10/2020', '01/01/2019', '11/01/2019', '02/10/2022', '02/01/2023'], 'out_date': ['01/10/2020', '10/12/2020', '10/01/2019', '01/10/2022', '01/01/2023', '31/12/2030'], 'char1': ['A', 'B', 'C', 'D', 'B', 'L']}
df2 = {'Key': ['1000', '1000', '2000', '2000', '2000'], 'in_date': ['01/05/2019', '02/09/2020', '20/01/2019', '16/02/2019', '11/01/2022'], 'out_date': ['01/09/2020', '10/11/2020', '15/02/2019', '10/01/2022', '31/12/2030'], 'char2': ['G', 'GG', 'K', 'GE', 'GG']}
The function is:
def synchronize(df1,df2):
cuts1 = {}
cuts2 = {}
## Define the cutpoints for each Key
for i,key in enumerate(df1["Key"]):
cuts = [datetime.strptime(df1["in_date"][i],"%d/%m/%Y"),datetime.strptime(df1["out_date"][i],"%d/%m/%Y")]
try:
cuts1[key]+=cuts
except:
cuts1[key]=cuts
cuts1[key]=sorted(cuts1[key])
for i,key in enumerate(df2["Key"]):
cuts = [datetime.strptime(df2["in_date"][i],"%d/%m/%Y"),datetime.strptime(df2["out_date"][i],"%d/%m/%Y")]
try:
cuts2[key]+=cuts
except:
cuts2[key]=cuts
cuts2[key]=sorted(cuts2[key])
## Now we have two lists of sorted cutpoints to split periods in the other data frame
## Extend both data frames to adhere to these cuts:
outdf1 = {}
outdf2 = {}
for v in df1.keys(): outdf1[v] = []
for v in df2.keys(): outdf2[v] = []
for i,key in enumerate(df1["Key"]):
st,et = datetime.strptime(df1["in_date"][i],"%d/%m/%Y"),datetime.strptime(df1["out_date"][i],"%d/%m/%Y")
rcuts = []
for c in cuts2[key]:
if c>st and c<et: ## Cut within this session
rcuts.append(c)
if len(rcuts)==0:
periods = [(st,et)]
else:
periods = [(st,rcuts[0])]
for ci in range(len(rcuts)-1):
periods.append((rcuts[ci],rcuts[ci+1]))
periods.append((rcuts[-1],et))
for p in periods:
outdf1["Key"].append(key)
outdf1["in_date"].append(p[0].strftime("%d/%m/%Y"))
outdf1["out_date"].append(p[1].strftime("%d/%m/%Y"))
outdf1["char1"].append(df1["char1"][i])
for i,key in enumerate(df2["Key"]):
st,et = datetime.strptime(df2["in_date"][i],"%d/%m/%Y"),datetime.strptime(df2["out_date"][i],"%d/%m/%Y")
rcuts = []
for c in cuts1[key]:
if c>st and c<et: ## Cut within this session
rcuts.append(c)
if len(rcuts)==0:
periods = [(st,et)]
else:
periods = [(st,rcuts[0])]
ci=0
for ci in range(len(rcuts)-1):
periods.append((rcuts[ci],rcuts[ci+1]))
periods.append((rcuts[-1],et))
for p in periods:
outdf2["Key"].append(key)
outdf2["in_date"].append(p[0].strftime("%d/%m/%Y"))
outdf2["out_date"].append(p[1].strftime("%d/%m/%Y"))
outdf2["char2"].append(df2["char2"][i])
return outdf1,outdf2
Calling this function as: new_df1,new_df2 = synchronize(df1,df2)
will result in two new dataframes with the following contents:
#new_df1
Key in_date out_date char1
1000 01/01/2020 01/09/2020 A
1000 01/09/2020 02/09/2020 A
1000 02/09/2020 01/10/2020 A
1000 02/10/2020 10/11/2020 B
1000 10/11/2020 10/12/2020 B
2000 01/01/2019 10/01/2019 C
2000 11/01/2019 20/01/2019 D
2000 20/01/2019 15/02/2019 D
2000 15/02/2019 16/02/2019 D
2000 16/02/2019 10/01/2022 D
2000 10/01/2022 11/01/2022 D
2000 11/01/2022 01/10/2022 D
2000 02/10/2022 01/01/2023 B
2000 02/01/2023 31/12/2030 L
## new_df2
Key in_date out_date char2
1000 01/05/2019 01/01/2020 G
1000 01/01/2020 01/09/2020 G
1000 02/09/2020 01/10/2020 GG
1000 01/10/2020 02/10/2020 GG
1000 02/10/2020 10/11/2020 GG
2000 20/01/2019 15/02/2019 K
2000 16/02/2019 10/01/2022 GE
2000 11/01/2022 01/10/2022 GG
2000 01/10/2022 02/10/2022 GG
2000 02/10/2022 01/01/2023 GG
2000 01/01/2023 02/01/2023 GG
2000 02/01/2023 31/12/2030 GG
Now, you can merge the two dataframes using the Key
, in_date
and out_date
as keys.
Note that there are 1-day periods where nothing was defined. For example, the period for Key 1000 and char A in dataframe1 is interrupted by the end of a period on 1.9.2020 and a beginning of another one on 2.9.2020.
After merging, you may need to clean up the dates, subtracting one day from out_time
timestamps where they are the same as the next in_time
timestamp and removing the now existing 0-day periods. But you can only do this after merging, as the dates would not match up, otherwise.