Search code examples
pythonpandaspython-datetimebinning

Pandas- binning number of minutes in various datetime ranges


I am looking for an efficient way to process the following data in pandas.

I have a dataframe containing several hundred thousand start and end timestamps:

data_df
                      start_ts                     end_ts
0    2019-06-10 12:00:00+00:00  2019-06-10 22:30:00+00:00
1    2019-06-11 12:00:00+00:00  2019-06-11 13:30:00+00:00
2    2019-06-11 14:00:00+00:00  2019-06-11 19:00:00+00:00
3    2019-06-14 12:00:00+00:00  2019-06-14 18:30:00+00:00
4    2019-06-10 12:00:00+00:00  2019-06-10 21:30:00+00:00
5    2019-06-11 12:00:00+00:00  2019-06-11 18:30:00+00:00
...

I also have a set of labeled time bins (tp1-tp10). There are 10 bins each day, but the times of those bins can change from day to do (e.g.- tp1 might be from 00:00 to 01:30 on one day, but then 00:00 to 01:45 on another day). Each data set to be processed has 7 days, with 10 time periods per day, so the set of ranges is of size 70, and looks like:

labeled_bins_df
                   start_range                  end_range  label
0    2019-06-10 00:00:00+00:00  2019-06-10 04:30:00+00:00    tp1
1    2019-06-10 04:30:00+00:00  2019-06-10 09:45:00+00:00    tp2
2    2019-06-10 09:45:00+00:00  2019-06-10 12:30:00+00:00    tp3
...

What I would like is a table with the original data_df data, but with additional columns, tp1 through tp10, with the number of minutes each row:

timed_bins
                      start_ts                     end_ts    tp1    tp2    tp3    tp4 ...
0    2019-06-10 12:00:00+00:00  2019-06-10 22:30:00+00:00      0      0     30    120 ...
1    2019-06-11 12:00:00+00:00  2019-06-11 13:30:00+00:00      0     45     45      0 ...

I am currently doing this naively, looping over my rows, and searching for the bins each data row is in, and as you can imagine, this is quite slow. Is there any pandas-fu that can be performed to do this kind of binning on datetime ranges?

EDIT: A thought, which might help think in a new direction. If I were to convert all of my timestamps (both in my data, and in my labeled bins) into unix timestamps (seconds since Jan 1, 1970), then it would be a matter of binning/summing based on integer ranges rather than dates. This would then yield the number of seconds in each bin, simply divide by 60, and I get my minutes in each bin. This takes away all concerns about date boundaries, etc.

EDIT 2: As requested, here is a set of simplified sample data, using three different time bins. I specifically made one of the data samples (the second row) span 2 days. Additionally, there is a result_df that shows the expected output.

data_samples = [
    {'start_ts': '2019-06-10T12:00:00+0000', 'end_ts': '2019-06-10T22:30:00+0000'},
    {'start_ts': '2019-06-10T22:00:00+0000', 'end_ts': '2019-06-11T05:30:00+0000'},
    {'start_ts': '2019-06-10T10:00:00+0000', 'end_ts': '2019-06-10T14:15:00+0000'},
    {'start_ts': '2019-06-12T08:07:00+0000', 'end_ts': '2019-06-12T18:22:00+0000'},
    {'start_ts': '2019-06-11T14:03:00+0000', 'end_ts': '2019-06-11T15:30:00+0000'},
    {'start_ts': '2019-06-11T02:33:00+0000', 'end_ts': '2019-06-11T10:31:00+0000'}
]

data_set = [{
    'start_ts': datetime.datetime.strptime(x['start_ts'], '%Y-%m-%dT%H:%M:%S%z'),
    'end_ts': datetime.datetime.strptime(x['end_ts'], '%Y-%m-%dT%H:%M:%S%z')} for x in data_samples]

data_df = pd.DataFrame(data_set)[['start_ts', 'end_ts']]

time_bin_samples = [
    {'start_ts': '2019-06-10T00:00:00+0000', 'end_ts': '2019-06-10T08:15:00+0000', 'label': 't1'},
    {'start_ts': '2019-06-10T08:15:00+0000', 'end_ts': '2019-06-10T18:00:00+0000', 'label': 't2'},
    {'start_ts': '2019-06-10T18:00:00+0000', 'end_ts': '2019-06-11T00:00:00+0000', 'label': 't3'},

    {'start_ts': '2019-06-11T00:00:00+0000', 'end_ts': '2019-06-11T09:00:00+0000', 'label': 't1'},
    {'start_ts': '2019-06-11T09:00:00+0000', 'end_ts': '2019-06-11T19:15:00+0000', 'label': 't2'},
    {'start_ts': '2019-06-11T19:15:00+0000', 'end_ts': '2019-06-12T00:00:00+0000', 'label': 't3'},

    {'start_ts': '2019-06-12T00:00:00+0000', 'end_ts': '2019-06-12T10:30:00+0000', 'label': 't1'},
    {'start_ts': '2019-06-12T10:30:00+0000', 'end_ts': '2019-06-12T12:00:00+0000', 'label': 't2'},
    {'start_ts': '2019-06-12T12:00:00+0000', 'end_ts': '2019-06-13T00:00:00+0000', 'label': 't3'},
]

time_bin_set = [{
    'start_ts': datetime.datetime.strptime(x['start_ts'], '%Y-%m-%dT%H:%M:%S%z'),
    'end_ts': datetime.datetime.strptime(x['end_ts'], '%Y-%m-%dT%H:%M:%S%z'),
    'label': x['label']} for x in time_bin_samples
]

time_bin_df = pd.DataFrame(time_bin_set)[['start_ts', 'end_ts', 'label']]

result_set = [
    {'t1': 0, 't2': 360, 't3': 270},
    {'t1': 330, 't2': 0, 't3': 120},
    {'t1': 0, 't2': 255, 't3': 0},
    {'t1': 143, 't2': 90, 't3': 382},
    {'t1': 0, 't2': 87, 't3': 0},
    {'t1': 387, 't2': 91, 't3': 0}
]

result_df = pd.DataFrame(result_set)

Solution

  • I know that iterating the rows of a dataframe is not efficient.

    Here I would try to identify the first and last bin per row in data_df by using merge_asof.

    Then I would build a list of sub-dataframes by iterating once the dataframe values in order to add all the bins corresponding to a row and would concat that list.

    From there it it enough to compute the time interval per bin and use pivot_table to get the expected result.

    Code could be:

    # store the index as a column to make sure to keep it
    data_df = data_df.rename_axis('ix').reset_index().sort_values(
        ['end_ts', 'start_ts'])
    time_bin_df = time_bin_df.rename_axis('ix').reset_index().sort_values(
        ['end_ts', 'start_ts'])
    
    # identify first and last bin per row
    first = pd.merge_asof(data_df, time_bin_df, left_on='start_ts',
                          right_on='end_ts', suffixes=('', '_first'),
                          direction='forward').values
    last = pd.merge_asof(data_df, time_bin_df, left_on='end_ts', right_on='start_ts',
                         suffixes=('', '_ bin')).values
    
    # build a list of bin dataframes (one per row in data_df)
    data = []
    for i, val in enumerate(first):
        elt = time_bin_df[(time_bin_df['ix']>=val[3])
                          &(time_bin_df['ix']<=last[i][3])].copy()
        # compute the begin and end of the intersection of the period and the bin
        elt.loc[elt['start_ts']<val[1], 'start_ts'] = val[1]
        elt.loc[elt['end_ts']>val[2], 'end_ts'] = val[2]
        elt['ix_data'] = val[0]
        data.append(elt)
    
    # concat everything
    tmp = pd.concat(data)
    
    # compute durations in minutes
    tmp['duration'] = (tmp['end_ts'] - tmp['start_ts']).dt.total_seconds() / 60
    
    # pivot to get the expected result
    result_df = tmp.pivot_table('duration', 'ix_data', 'label', 'sum', fill_value=0
                                ).rename_axis(None).rename_axis(None, axis=1)
    

    It can take some time because there is still one lengthy operation to build the list of dataframes, but other operations should be vectorized.