I am looking for an efficient way to process the following data in pandas.
I have a dataframe containing several hundred thousand start and end timestamps:
data_df
start_ts end_ts
0 2019-06-10 12:00:00+00:00 2019-06-10 22:30:00+00:00
1 2019-06-11 12:00:00+00:00 2019-06-11 13:30:00+00:00
2 2019-06-11 14:00:00+00:00 2019-06-11 19:00:00+00:00
3 2019-06-14 12:00:00+00:00 2019-06-14 18:30:00+00:00
4 2019-06-10 12:00:00+00:00 2019-06-10 21:30:00+00:00
5 2019-06-11 12:00:00+00:00 2019-06-11 18:30:00+00:00
...
I also have a set of labeled time bins (tp1
-tp10
). There are 10 bins each day, but the times of those bins can change from day to do (e.g.- tp1
might be from 00:00 to 01:30 on one day, but then 00:00 to 01:45 on another day). Each data set to be processed has 7 days, with 10 time periods per day, so the set of ranges is of size 70, and looks like:
labeled_bins_df
start_range end_range label
0 2019-06-10 00:00:00+00:00 2019-06-10 04:30:00+00:00 tp1
1 2019-06-10 04:30:00+00:00 2019-06-10 09:45:00+00:00 tp2
2 2019-06-10 09:45:00+00:00 2019-06-10 12:30:00+00:00 tp3
...
What I would like is a table with the original data_df
data, but with additional columns, tp1
through tp10
, with the number of minutes each row:
timed_bins
start_ts end_ts tp1 tp2 tp3 tp4 ...
0 2019-06-10 12:00:00+00:00 2019-06-10 22:30:00+00:00 0 0 30 120 ...
1 2019-06-11 12:00:00+00:00 2019-06-11 13:30:00+00:00 0 45 45 0 ...
I am currently doing this naively, looping over my rows, and searching for the bins each data row is in, and as you can imagine, this is quite slow. Is there any pandas-fu that can be performed to do this kind of binning on datetime ranges?
EDIT: A thought, which might help think in a new direction. If I were to convert all of my timestamps (both in my data, and in my labeled bins) into unix timestamps (seconds since Jan 1, 1970), then it would be a matter of binning/summing based on integer ranges rather than dates. This would then yield the number of seconds in each bin, simply divide by 60, and I get my minutes in each bin. This takes away all concerns about date boundaries, etc.
EDIT 2: As requested, here is a set of simplified sample data, using three different time bins. I specifically made one of the data samples (the second row) span 2 days. Additionally, there is a result_df
that shows the expected output.
data_samples = [
{'start_ts': '2019-06-10T12:00:00+0000', 'end_ts': '2019-06-10T22:30:00+0000'},
{'start_ts': '2019-06-10T22:00:00+0000', 'end_ts': '2019-06-11T05:30:00+0000'},
{'start_ts': '2019-06-10T10:00:00+0000', 'end_ts': '2019-06-10T14:15:00+0000'},
{'start_ts': '2019-06-12T08:07:00+0000', 'end_ts': '2019-06-12T18:22:00+0000'},
{'start_ts': '2019-06-11T14:03:00+0000', 'end_ts': '2019-06-11T15:30:00+0000'},
{'start_ts': '2019-06-11T02:33:00+0000', 'end_ts': '2019-06-11T10:31:00+0000'}
]
data_set = [{
'start_ts': datetime.datetime.strptime(x['start_ts'], '%Y-%m-%dT%H:%M:%S%z'),
'end_ts': datetime.datetime.strptime(x['end_ts'], '%Y-%m-%dT%H:%M:%S%z')} for x in data_samples]
data_df = pd.DataFrame(data_set)[['start_ts', 'end_ts']]
time_bin_samples = [
{'start_ts': '2019-06-10T00:00:00+0000', 'end_ts': '2019-06-10T08:15:00+0000', 'label': 't1'},
{'start_ts': '2019-06-10T08:15:00+0000', 'end_ts': '2019-06-10T18:00:00+0000', 'label': 't2'},
{'start_ts': '2019-06-10T18:00:00+0000', 'end_ts': '2019-06-11T00:00:00+0000', 'label': 't3'},
{'start_ts': '2019-06-11T00:00:00+0000', 'end_ts': '2019-06-11T09:00:00+0000', 'label': 't1'},
{'start_ts': '2019-06-11T09:00:00+0000', 'end_ts': '2019-06-11T19:15:00+0000', 'label': 't2'},
{'start_ts': '2019-06-11T19:15:00+0000', 'end_ts': '2019-06-12T00:00:00+0000', 'label': 't3'},
{'start_ts': '2019-06-12T00:00:00+0000', 'end_ts': '2019-06-12T10:30:00+0000', 'label': 't1'},
{'start_ts': '2019-06-12T10:30:00+0000', 'end_ts': '2019-06-12T12:00:00+0000', 'label': 't2'},
{'start_ts': '2019-06-12T12:00:00+0000', 'end_ts': '2019-06-13T00:00:00+0000', 'label': 't3'},
]
time_bin_set = [{
'start_ts': datetime.datetime.strptime(x['start_ts'], '%Y-%m-%dT%H:%M:%S%z'),
'end_ts': datetime.datetime.strptime(x['end_ts'], '%Y-%m-%dT%H:%M:%S%z'),
'label': x['label']} for x in time_bin_samples
]
time_bin_df = pd.DataFrame(time_bin_set)[['start_ts', 'end_ts', 'label']]
result_set = [
{'t1': 0, 't2': 360, 't3': 270},
{'t1': 330, 't2': 0, 't3': 120},
{'t1': 0, 't2': 255, 't3': 0},
{'t1': 143, 't2': 90, 't3': 382},
{'t1': 0, 't2': 87, 't3': 0},
{'t1': 387, 't2': 91, 't3': 0}
]
result_df = pd.DataFrame(result_set)
I know that iterating the rows of a dataframe is not efficient.
Here I would try to identify the first and last bin per row in data_df
by using merge_asof
.
Then I would build a list of sub-dataframes by iterating once the dataframe values in order to add all the bins corresponding to a row and would concat that list.
From there it it enough to compute the time interval per bin and use pivot_table
to get the expected result.
Code could be:
# store the index as a column to make sure to keep it
data_df = data_df.rename_axis('ix').reset_index().sort_values(
['end_ts', 'start_ts'])
time_bin_df = time_bin_df.rename_axis('ix').reset_index().sort_values(
['end_ts', 'start_ts'])
# identify first and last bin per row
first = pd.merge_asof(data_df, time_bin_df, left_on='start_ts',
right_on='end_ts', suffixes=('', '_first'),
direction='forward').values
last = pd.merge_asof(data_df, time_bin_df, left_on='end_ts', right_on='start_ts',
suffixes=('', '_ bin')).values
# build a list of bin dataframes (one per row in data_df)
data = []
for i, val in enumerate(first):
elt = time_bin_df[(time_bin_df['ix']>=val[3])
&(time_bin_df['ix']<=last[i][3])].copy()
# compute the begin and end of the intersection of the period and the bin
elt.loc[elt['start_ts']<val[1], 'start_ts'] = val[1]
elt.loc[elt['end_ts']>val[2], 'end_ts'] = val[2]
elt['ix_data'] = val[0]
data.append(elt)
# concat everything
tmp = pd.concat(data)
# compute durations in minutes
tmp['duration'] = (tmp['end_ts'] - tmp['start_ts']).dt.total_seconds() / 60
# pivot to get the expected result
result_df = tmp.pivot_table('duration', 'ix_data', 'label', 'sum', fill_value=0
).rename_axis(None).rename_axis(None, axis=1)
It can take some time because there is still one lengthy operation to build the list of dataframes, but other operations should be vectorized.