Search code examples
mysqlsqleventsaggregateoverlapping

Aggregating overlapping events With MySQL


So lets say we have data that looks likes:

drop table if exists views; 
create table views(id int primary key,start time,end time); 
insert into views values 
(1, '15:01', '15:04'), 
(2, '15:02', '15:09'), 
(3, '15:12', '15:15'), 
(4, '16:11', '16:23'), 
(5, '16:19', '16:25'), 
(6, '17:52', '17:59'), 
(7, '18:18', '18:22'), 
(8, '16:20', '16:22'), 
(9, '18:17', '18:23'); 

Easily visualized like this

1     |-----| 
2        |-----| 
3                 |--| 
4                       |-----| 
5                          |-----| 
6                                  |---| 
7                                        |---|  
8                           |---| 
9                                       |-----| 

Now I want to graph that data so it looks like this

+---------------------------+
|              x            |
|    x        x xxx     xxx |
|   x xx  xx x     xx  x    |
+---------------------------+

essentially breaking them up into segments of X length and summing up how many times each X length segment is touched. Any thoughts on how to create this view?

(If you must know this it so I can create Engagement Data for Video Analytics)

I dont want the output to be ASCII I want it to end up as query result in SQL. Something like:

Time Start, Time End,  Num_Views
00:00, 00:05, 10
00:06, 00:10, 3
00:11, 00:15, 2
00:16, 00:20, 8

Solution

  • Using an auxiliary numbers table, you could do something like this:

    select
      r.Time_Start,
      r.Time_End,
      sum(v.id is not null) as Num_Views
    from (
      select
        cast(from_unixtime((m.minstart + n.n + 0) * 300) as time) as Time_Start,
        cast(from_unixtime((m.minstart + n.n + 1) * 300) as time) as Time_End
      from (
        select
          unix_timestamp(date_format(minstart, '1970-01-01 %T')) div 300 as minstart,
          unix_timestamp(date_format(maxend  , '1970-01-01 %T')) div 300 as maxend
        from (
          select
            min(start) as minstart,
            max(end  ) as maxend
          from views
        ) s
      ) m
        cross join numbers n
      where n.n between 0 and m.maxend - minstart
    ) r
      left join views v on v.start < r.Time_End and v.end > r.Time_Start
    group by
      r.Time_Start,
      r.Time_End
    ;
    

    For your particular example this script produces the following output:

    Time_Start  Time_End  Num_Views
    ----------  --------  ---------
    15:00:00    15:05:00  2
    15:05:00    15:10:00  1
    15:10:00    15:15:00  1
    15:15:00    15:20:00  0
    15:20:00    15:25:00  0
    15:25:00    15:30:00  0
    15:30:00    15:35:00  0
    15:35:00    15:40:00  0
    15:40:00    15:45:00  0
    15:45:00    15:50:00  0
    15:50:00    15:55:00  0
    15:55:00    16:00:00  0
    16:00:00    16:05:00  0
    16:05:00    16:10:00  0
    16:10:00    16:15:00  1
    16:15:00    16:20:00  2
    16:20:00    16:25:00  3
    16:25:00    16:30:00  0
    16:30:00    16:35:00  0
    16:35:00    16:40:00  0
    16:40:00    16:45:00  0
    16:45:00    16:50:00  0
    16:50:00    16:55:00  0
    16:55:00    17:00:00  0
    17:00:00    17:05:00  0
    17:05:00    17:10:00  0
    17:10:00    17:15:00  0
    17:15:00    17:20:00  0
    17:20:00    17:25:00  0
    17:25:00    17:30:00  0
    17:30:00    17:35:00  0
    17:35:00    17:40:00  0
    17:40:00    17:45:00  0
    17:45:00    17:50:00  0
    17:50:00    17:55:00  1
    17:55:00    18:00:00  1
    18:00:00    18:05:00  0
    18:05:00    18:10:00  0
    18:10:00    18:15:00  0
    18:15:00    18:20:00  2
    18:20:00    18:25:00  2
    

    A numbers table could be a temporary one, although I would recommend you to create and initialise a permanent table, as it can be useful for many purposes. Here's one way of initialising a numbers table:

    create table numbers (n int);
    insert into numbers (n) select 0;
    insert into numbers (n) select cnt + n from numbers, (select count(*) as cnt from numbers) s;
    insert into numbers (n) select cnt + n from numbers, (select count(*) as cnt from numbers) s;
    insert into numbers (n) select cnt + n from numbers, (select count(*) as cnt from numbers) s;
    insert into numbers (n) select cnt + n from numbers, (select count(*) as cnt from numbers) s;
    insert into numbers (n) select cnt + n from numbers, (select count(*) as cnt from numbers) s;
    insert into numbers (n) select cnt + n from numbers, (select count(*) as cnt from numbers) s;
    insert into numbers (n) select cnt + n from numbers, (select count(*) as cnt from numbers) s;
    insert into numbers (n) select cnt + n from numbers, (select count(*) as cnt from numbers) s;
    /* repeat as necessary; every repeated line doubles the number of rows */
    

    A ‘live’ version of this script can be found on SQL Fiddle.

    UPDATE (an attempt at a description of the method used)

    The above solution implements the following steps:

    1. Find the earliest start time and the latest end time in the views table.

    2. Convert both values to unix timestamps.

    3. Divide both timestamps by 300, which essentially gives us the indexes of the corresponding 5-minute ranges (since the Epoch).

    4. With the help of a numbers table, generate a series of 5-minute ranges covering the overall range between start and end.

    5. Match the range list against the event times in the views table (using an outer join, because we want (if we want) to account for all the ranges).

    6. Group the results by the range bounds and count the number of events in the groups. (And I've just noticed that the sum(v.id is not null) in the above query could be replaced with the more concise and, in this case, more natural count(v.id).)