The problem: group together events that occur close to each other in time, that also have another variable that is equal. For example, given the date of disease onset, and an address, find disease outbreaks that occur at the same location within specified timeframe of each other. Large - 300K rows - pandas dataframe. Example data:
df = pd.DataFrame(
[
['2020-01-01 10:00', '1', 'A'],
['2020-01-01 10:01', '2', 'A'],
['2020-01-01 10:02', '3a', 'A'],
['2020-01-01 10:02', '3b', 'A'],
['2020-01-02 10:03', '4', 'B'],
['2020-01-02 10:50', '5', 'B'],
['2020-01-02 10:54', '6', 'B'],
['2020-01-02 10:55', '7', 'B'],
], columns=['event_time', 'event_id', 'Address']
)
The output should have rows with the first and last event date, a list of the the events and the address
event_time_start event_time_end events_and_related_event_id_list Address
0 2020-01-01 10:00:00 2020-01-01 10:02:00 [1, 2, 3a] A
6 2020-01-01 10:54:00 2020-01-01 10:55:00 [6, 7] B
EDITED - to clarify - SOLUTION
The solution by jezrael to match dates within a specified number of days before or after a date is based on a similar approach from another thread, but includes a groupby for the Address. This first step works perfectly without modification on the real data. It is not changed below, except to name some of the values for clarity.
The second step did not work because, unlike the example data, the real data contained non-continuous and non-sequential events. This required: sorting of the first output by Address and event_time; different logic for the boolean series to groups event_times together (m/timeGroup_bool); and removal of the bool series as df filter for the Groupby.agg.
Here is the full solution with modifications and clarifications based on jezrael's simply awesome response (the f1 lambda, which collects all values from the grouped lists, is best explained here).:
df = pd.DataFrame(
[
['1', 'A', '2020-01-01 10:00'],
['2', 'B', '2020-01-01 10:01'],
['3', 'A', '2020-01-01 10:01'],
['4', 'C', '2020-01-01 10:02'],
['5', 'D', '2020-01-01 10:03'],
['6', 'A', '2020-01-01 10:03'],
['7', 'E', '2020-01-01 10:03'],
['8', 'A', '2020-01-01 10:07'],
['9', 'A', '2020-01-01 10:09'],
['10', 'A', '2020-01-01 10:11'],
['11', 'F', '2020-01-01 10:54'],
['12', 'G', '2020-01-01 10:55'],
['13', 'F', '2020-01-01 10:56'],
], columns=['id', 'Address', 'event_time']
)
df = df.sort_values(by=["Address", "event_time"])
df['event_time'] = pd.to_datetime(df['event_time'])
## group by address and surrounding time
timeDiff = pd.Timedelta("2m") # time span between related events
def idsNearDates(mDf):
f = lambda colName, val: mDf.loc[mDf['event_time'].between(val - timeDiff, val + timeDiff),
'id'].drop(colName).tolist()
mDf['relatedIds'] = [f(colName, value) for colName, value in mDf['event_time'].items()]
return mDf
df_1stStep = df.groupby('Address').apply(idsNearDates).sort_values(by=["Address", 'event_time'])
## aggregate the initial output into a single row per related events
# mark where event times are too far apart
timeGroup_bool = ~(df_1stStep['event_time'].between(df_1stStep['event_time'].shift(1) - timeDiff,
df_1stStep['event_time'].shift(1) + timeDiff))
# create a single list from all grouped lists
f1 = lambda x: list(dict.fromkeys([value for idList in x for value in idList]))
df_2ndstep = (df_1stStep.groupby([(timeGroup_bool).cumsum(),'Address'])
.agg(Date_first=('event_time','min'),
Date_last=('event_time','max'),
Ids=('relatedIds',f1))
.droplevel(0)
.reset_index())
# get rid of rows with empty lists
df_2ndstep = df_2ndstep[df_2ndstep['Ids'].str.len() > 0]
You can use thi solution per groups:
# solution matching dates within range of date in row by jezrael
df['event_time'] = pd.to_datetime(df['event_time'])
def f(g):
td = pd.Timedelta("1m")
f = lambda x, y: g.loc[g['event_time'].between(y - td, y + td),
'event_id'].drop(x).tolist()
g['related_event_id_list'] = [f(k, v) for k, v in g['event_time'].items()]
return g
df = df.groupby('Address').apply(f)
print (df)
event_time event_id Address related_event_id_list
0 2020-01-01 10:00:00 1 A [2]
1 2020-01-01 10:01:00 2 A [1, 3a]
2 2020-01-01 10:02:00 3a A [2]
3 2020-01-01 10:02:00 3b B []
4 2020-01-01 10:30:00 4 B []
5 2020-01-01 10:50:00 5 B []
6 2020-01-01 10:54:00 6 B [7]
7 2020-01-01 10:55:00 7 B [6]
For next step use GroupBy.agg
for groups created by consecutive non empty list values in related_event_id_list
column:
m = df['related_event_id_list'].astype(bool)
f1 = lambda x: list(dict.fromkeys([z for y in x for z in y]))
df = (df[m].groupby([(~m).cumsum(),'Address'])
.agg(event_time_start=('event_time','min'),
event_time_end=('event_time','max'),
events_and_related_event_id_list=('related_event_id_list',f1))
.droplevel(0)
.reset_index())
print (df)
Address event_time_start event_time_end \
0 A 2020-01-01 10:00:00 2020-01-01 10:02:00
1 B 2020-01-01 10:54:00 2020-01-01 10:55:00
events_and_related_event_id_list
0 [2, 1, 3a]
1 [7, 6]