Pandas group events close together by date, then test if other values are equal

The problem: group together events that occur close to each other in time, that also have another variable that is equal. For example, given the date of disease onset, and an address, find disease outbreaks that occur at the same location within specified timeframe of each other. Large - 300K rows - pandas dataframe. Example data:

df = pd.DataFrame(
    [
    ['2020-01-01 10:00', '1', 'A'],
    ['2020-01-01 10:01', '2', 'A'],
    ['2020-01-01 10:02', '3a', 'A'],
    ['2020-01-01 10:02', '3b', 'A'],
    ['2020-01-02 10:03', '4', 'B'],
    ['2020-01-02 10:50', '5', 'B'],
    ['2020-01-02 10:54', '6', 'B'],
    ['2020-01-02 10:55', '7', 'B'],
], columns=['event_time', 'event_id', 'Address']
)

The output should have rows with the first and last event date, a list of the the events and the address

    event_time_start  event_time_end     events_and_related_event_id_list  Address
0 2020-01-01 10:00:00  2020-01-01 10:02:00        [1, 2, 3a]     A
6 2020-01-01 10:54:00  2020-01-01 10:55:00        [6, 7]         B

EDITED - to clarify - SOLUTION

The solution by jezrael to match dates within a specified number of days before or after a date is based on a similar approach from another thread, but includes a groupby for the Address. This first step works perfectly without modification on the real data. It is not changed below, except to name some of the values for clarity.

The second step did not work because, unlike the example data, the real data contained non-continuous and non-sequential events. This required: sorting of the first output by Address and event_time; different logic for the boolean series to groups event_times together (m/timeGroup_bool); and removal of the bool series as df filter for the Groupby.agg.

Here is the full solution with modifications and clarifications based on jezrael's simply awesome response (the f1 lambda, which collects all values from the grouped lists, is best explained here).:

    df = pd.DataFrame(
    [
        ['1', 'A', '2020-01-01 10:00'],
        ['2', 'B', '2020-01-01 10:01'],
        ['3', 'A', '2020-01-01 10:01'],
        ['4', 'C', '2020-01-01 10:02'],
        ['5', 'D', '2020-01-01 10:03'],
        ['6', 'A', '2020-01-01 10:03'],
        ['7', 'E', '2020-01-01 10:03'],
        ['8', 'A', '2020-01-01 10:07'],
        ['9', 'A', '2020-01-01 10:09'],
        ['10', 'A', '2020-01-01 10:11'],
        ['11', 'F', '2020-01-01 10:54'],
        ['12', 'G', '2020-01-01 10:55'],
        ['13', 'F', '2020-01-01 10:56'],
    ], columns=['id', 'Address', 'event_time']
)
df = df.sort_values(by=["Address", "event_time"])
df['event_time'] = pd.to_datetime(df['event_time'])

## group by address and surrounding time
timeDiff = pd.Timedelta("2m")       # time span between related events
def idsNearDates(mDf):
    f = lambda colName, val: mDf.loc[mDf['event_time'].between(val - timeDiff, val + timeDiff),
                            'id'].drop(colName).tolist()
    mDf['relatedIds'] = [f(colName, value) for colName, value in mDf['event_time'].items()]
    return mDf
df_1stStep  = df.groupby('Address').apply(idsNearDates).sort_values(by=["Address", 'event_time'])

## aggregate the initial output into a single row per related events
# mark where event times are too far apart
timeGroup_bool = ~(df_1stStep['event_time'].between(df_1stStep['event_time'].shift(1) - timeDiff,
                                                             df_1stStep['event_time'].shift(1) + timeDiff))
# create a single list from all grouped lists
f1 = lambda x: list(dict.fromkeys([value for idList in x for value in idList]))
df_2ndstep = (df_1stStep.groupby([(timeGroup_bool).cumsum(),'Address'])
                   .agg(Date_first=('event_time','min'),
                        Date_last=('event_time','max'),
                        Ids=('relatedIds',f1))
                   .droplevel(0)
                   .reset_index())
# get rid of rows with empty lists
df_2ndstep = df_2ndstep[df_2ndstep['Ids'].str.len() > 0]

Solution

You can use thi solution per groups:

# solution matching dates within range of date in row by jezrael
df['event_time'] = pd.to_datetime(df['event_time'])

def f(g):
    td = pd.Timedelta("1m")
    f = lambda x, y: g.loc[g['event_time'].between(y - td, y + td),
                            'event_id'].drop(x).tolist()
    g['related_event_id_list'] = [f(k, v) for k, v in g['event_time'].items()]
    return g

df  = df.groupby('Address').apply(f)
print (df)
           event_time event_id Address related_event_id_list
0 2020-01-01 10:00:00        1       A                   [2]
1 2020-01-01 10:01:00        2       A               [1, 3a]
2 2020-01-01 10:02:00       3a       A                   [2]
3 2020-01-01 10:02:00       3b       B                    []
4 2020-01-01 10:30:00        4       B                    []
5 2020-01-01 10:50:00        5       B                    []
6 2020-01-01 10:54:00        6       B                   [7]
7 2020-01-01 10:55:00        7       B                   [6]

For next step use GroupBy.agg for groups created by consecutive non empty list values in related_event_id_list column:

m = df['related_event_id_list'].astype(bool)

f1 = lambda x: list(dict.fromkeys([z for y in x for z in y]))

df = (df[m].groupby([(~m).cumsum(),'Address'])
           .agg(event_time_start=('event_time','min'),
                event_time_end=('event_time','max'),
                events_and_related_event_id_list=('related_event_id_list',f1))
           .droplevel(0)
           .reset_index())
print (df)
  Address    event_time_start      event_time_end  \
0       A 2020-01-01 10:00:00 2020-01-01 10:02:00   
1       B 2020-01-01 10:54:00 2020-01-01 10:55:00   

  events_and_related_event_id_list  
0                       [2, 1, 3a]  
1                           [7, 6]