Search code examples
pythonpandasintersectiongranularity

Breakdown data-frame into second-by-second time-series


I have this dataset of active subjects during specified time-periods.

    start    end name
0   00:00  00:10    a
1   00:10  00:20    b
2   00:00  00:20    c
3   00:00  00:10    d
4   00:10  00:15    e
5   00:15  00:20    a

The intervals are inclusive on the left(start) side and not inclusive on the right(end).

There are always three subjects active. I want to increase the granularity of the data, so that I will have info of the three active subjects for each second. Each second has three unique values.

This would be the desired result for the test case.

   slot1 slot2 slot3
0      a     c     d
1      a     c     d
2      a     c     d
3      a     c     d
4      a     c     d
5      a     c     d
6      a     c     d
7      a     c     d
8      a     c     d
9      a     c     d
10     b     c     e
11     b     c     e
12     b     c     e
13     b     c     e
14     b     c     e
15     b     c     a
16     b     c     a
17     b     c     a
18     b     c     a
19     b     c     a

The order of the subjects inside the slots is irrelevant for now. The subjects can reappear in the data like "a" from 00:00 to 00:10 and then again from 00:15 to 00:20. The intervals can be at any second.


Solution

  • Route 1: One (costly but easy) way is to explode the data to the seconds, then merge 3 times:

    time_df = (('00:' + df[['start','end']])
                   .apply(lambda x: pd.to_timedelta(x).dt.total_seconds())
                   .astype(int)
                   .apply(lambda x: np.arange(*x), axis=1)
                   .to_frame('time')
                   .assign(slot=df['name'])
                   .explode('time')
              )
    
    (time_df.merge(time_df, on='time', suffixes=['1','2'])
         .query('slot1 < slot2')
         .merge(time_df, on='time')
         .query('slot2 < slot')
    )
    

    Output:

        time slot1 slot2 slot
    2      0     a     c    d
    11     1     a     c    d
    20     2     a     c    d
    29     3     a     c    d
    38     4     a     c    d
    47     5     a     c    d
    56     6     a     c    d
    65     7     a     c    d
    74     8     a     c    d
    83     9     a     c    d
    92    10     b     c    e
    101   11     b     c    e
    110   12     b     c    e
    119   13     b     c    e
    128   14     b     c    e
    139   15     a     b    c
    148   16     a     b    c
    157   17     a     b    c
    166   18     a     b    c
    175   19     a     b    c
    

    Route 2: Another way is to cross merge then query the overlapping intervals:

    df[['start','end']] = (('00:' + df[['start','end']])
                   .apply(lambda x: pd.to_timedelta(x).dt.total_seconds())
                   .astype(int)
              )
    
    (df.merge(df, how='cross')
       .assign(start=lambda x: x.filter(like='start').max(axis=1),
               end=lambda x: x.filter(like='end').min(axis=1))
       .query('start < end & name_x < name_y')
       [['name_x','name_y','start','end']]
       .merge(df, how='cross')
       .assign(start=lambda x: x.filter(like='start').max(axis=1),
               end=lambda x: x.filter(like='end').min(axis=1))
       .query('start < end & name_y < name')
       [['start','end', 'name_x','name_y', 'name']]
    )
    

    Output:

        start  end name_x name_y name
    3       0   10      a      c    d
    16     10   15      b      c    e
    38     15   20      a      b    c
    

    As you can see the this output is just the same as the other, but in the original form. Depending on your data, one route might better than the other.


    Update Since your data has exactly 3 slot at any time, you can easily do with pivot. This is the best solution.

    # time_df as in Route 1
    (time_df.sort_values(['time','slot'])
        .assign(nums = lambda x: np.arange(len(x)) % 3)
        .pivot('time', 'nums', 'slot')
    )
    # in general, `.assign(nums=lambda x: x.groupby('time').cumcount()`
    # also works instead of the above
    

    Output:

    nums  0  1  2
    time         
    0     a  c  d
    1     a  c  d
    2     a  c  d
    3     a  c  d
    4     a  c  d
    5     a  c  d
    6     a  c  d
    7     a  c  d
    8     a  c  d
    9     a  c  d
    10    b  c  e
    11    b  c  e
    12    b  c  e
    13    b  c  e
    14    b  c  e
    15    a  b  c
    16    a  b  c
    17    a  b  c
    18    a  b  c
    19    a  b  c