Group ids by 2 date interval columns and 2 other columns

I have the following dataframe:

ID	Fruit	Price	Location	Start_Date	End_Date
01	Orange	12	ABC	01-03-2015	01-05-2015
01	Orange	9.5	ABC	01-03-2015	01-05-2015
02	Apple	10	PQR	04-09-2019	04-11-2019
06	Orange	11	ABC	01-04-2015	01-06-2015
05	Peach	15	XYZ	07-11-2021	07-13-2021
08	Apple	10.5	PQR	04-09-2019	04-11-2019
10	Apple	10	LMN	04-10-2019	04-12-2019
03	Peach	14.5	XYZ	07-11-2020	07-13-2020
11	Peach	12.5	ABC	01-04-2015	01-05-2015
12	Peach	12.5	ABC	01-03-2015	01-05-2015

I want to form a group of IDs that belong to the same location, fruit, and range of start date and end date. The date interval condition is that we only group those ids together whose start_date and end_date are no more than 3 days apart. Eg. ID 06 start_date is 01-04-2015 and end_date is 01-06-2015. ID 01 start_date is 01-03-2015 and end_date is 01-05-2015. So ID 06 and 01's start_date and end_date are only 1 day apart so the merge is acceptable (i.e. these two ids can be grouped together if other variables like location and fruit match).

Also, I only want to output groups with more than 1 unique IDs.

My output should be (the start date and end date is merged):

ID	Fruit	Price	Location	Start_Date	End_Date
01	Orange	12	ABC	01-03-2015	01-06-2015
01	Orange	9.5
06	Orange	11
11	Peach	12.5
12	Peach	12.5
02	Apple	10	PQR	04-09-2019	04-11-2019
08	Apple	10.5

IDs 05,03 get filtered out because it's a single record (they dont meet the date interval condition). ID 10 gets filtered out because it's from a different location.

I have no idea how to merge intervals for 2 such date columns. I have tried a few techniques to test out grouping (without the date merge).

My latest one is using grouper.

output = df.groupby([pd.Grouper(key='Start_Date', freq='D'),pd.Grouper(key='End_Date', freq='D'),pd.Grouper(key='Location'),pd.Grouper(key='Fruit'),'ID']).agg(unique_emp=('ID', 'nunique'))

Need help getting the output. Thank you!!

Solution

This is essentially a gap-and-island problem. If you sort your dataframe by Fruit, Location and Start Date, you can create islands (i.e. fruit group) as follow:

If the current row's Fruit or Location is not the same as the previous row's, start a new island
If the current row's End Date is more than 3 days after the island's Start Date, make a new island

The code:

for col in ["Start_Date", "End_Date"]:
    df[col] = pd.to_datetime(df[col])

# This algorithm requires a sorted dataframe
df = df.sort_values(["Fruit", "Location", "Start_Date"])

# Assign each row to an island
i = 0
islands = []
last_fruit, last_location, last_start = None, None, df["Start_Date"].iloc[0]

for _, (fruit, location, start, end) in df[["Fruit", "Location", "Start_Date", "End_Date"]].iterrows():
    if (fruit != last_fruit) or (location != last_location) or (end - last_start > pd.Timedelta(days=3)):
        i += 1
        last_fruit, last_location, last_start = fruit, location, start
    else:
        last_fruit, last_location = fruit, location
    islands.append(i)

df["Island"] = islands

# Filter for islands having more than 1 rows
idx = pd.Series(islands).value_counts().loc[lambda c: c > 1].index
df[df["Island"].isin(idx)]