I have the following dataframe:
ID | Fruit | Price | Location | Start_Date | End_Date |
---|---|---|---|---|---|
01 | Orange | 12 | ABC | 01-03-2015 | 01-05-2015 |
01 | Orange | 9.5 | ABC | 01-03-2015 | 01-05-2015 |
02 | Apple | 10 | PQR | 04-09-2019 | 04-11-2019 |
06 | Orange | 11 | ABC | 01-04-2015 | 01-06-2015 |
05 | Peach | 15 | XYZ | 07-11-2021 | 07-13-2021 |
08 | Apple | 10.5 | PQR | 04-09-2019 | 04-11-2019 |
10 | Apple | 10 | LMN | 04-10-2019 | 04-12-2019 |
03 | Peach | 14.5 | XYZ | 07-11-2020 | 07-13-2020 |
11 | Peach | 12.5 | ABC | 01-04-2015 | 01-05-2015 |
12 | Peach | 12.5 | ABC | 01-03-2015 | 01-05-2015 |
I want to form a group of IDs that belong to the same location, fruit, and range of start date and end date. The date interval condition is that we only group those ids together whose start_date and end_date are no more than 3 days apart. Eg. ID 06 start_date is 01-04-2015 and end_date is 01-06-2015. ID 01 start_date is 01-03-2015 and end_date is 01-05-2015. So ID 06 and 01's start_date and end_date are only 1 day apart so the merge is acceptable (i.e. these two ids can be grouped together if other variables like location and fruit match).
Also, I only want to output groups with more than 1 unique IDs.
My output should be (the start date and end date is merged):
ID | Fruit | Price | Location | Start_Date | End_Date |
---|---|---|---|---|---|
01 | Orange | 12 | ABC | 01-03-2015 | 01-06-2015 |
01 | Orange | 9.5 | |||
06 | Orange | 11 | |||
11 | Peach | 12.5 | |||
12 | Peach | 12.5 | |||
02 | Apple | 10 | PQR | 04-09-2019 | 04-11-2019 |
08 | Apple | 10.5 |
IDs 05,03 get filtered out because it's a single record (they dont meet the date interval condition). ID 10 gets filtered out because it's from a different location.
I have no idea how to merge intervals for 2 such date columns. I have tried a few techniques to test out grouping (without the date merge).
My latest one is using grouper.
output = df.groupby([pd.Grouper(key='Start_Date', freq='D'),pd.Grouper(key='End_Date', freq='D'),pd.Grouper(key='Location'),pd.Grouper(key='Fruit'),'ID']).agg(unique_emp=('ID', 'nunique'))
Need help getting the output. Thank you!!
This is essentially a gap-and-island problem. If you sort your dataframe by Fruit, Location and Start Date, you can create islands (i.e. fruit group) as follow:
The code:
for col in ["Start_Date", "End_Date"]:
df[col] = pd.to_datetime(df[col])
# This algorithm requires a sorted dataframe
df = df.sort_values(["Fruit", "Location", "Start_Date"])
# Assign each row to an island
i = 0
islands = []
last_fruit, last_location, last_start = None, None, df["Start_Date"].iloc[0]
for _, (fruit, location, start, end) in df[["Fruit", "Location", "Start_Date", "End_Date"]].iterrows():
if (fruit != last_fruit) or (location != last_location) or (end - last_start > pd.Timedelta(days=3)):
i += 1
last_fruit, last_location, last_start = fruit, location, start
else:
last_fruit, last_location = fruit, location
islands.append(i)
df["Island"] = islands
# Filter for islands having more than 1 rows
idx = pd.Series(islands).value_counts().loc[lambda c: c > 1].index
df[df["Island"].isin(idx)]