Search code examples
mysqldatetimecountinner-join

How to compute frequency of concurrent events by combination in MySQL?


I am looking for a way to identify event names names that co-occur: i.e., correlate event names with the same start (startts) and end (endts) times: the events are exactly concurrent (partial overlap is not a feature of this data base).

toy dataframe

+------------------+
|name startts endts|
| A   02:20  02:23 |
| A   02:23  02:25 |
| A   02:27  02:28 |
| B   02:20  02:23 |
| B   02:23  02:25 |
| B   02:25  02:27 |
| C   02:27  02:28 |
| D   02:27  02:28 |
| D   02:28  02:31 |
| E   02:27  02:28 |
| E   02:29  02:31 |
+------------------+

Ideal output:


+---------------------------+
|combination| count         |
+---------------------------+
|  AB       | 2             |
|  AC       | 1             |
|  AE       | 1             |
|  AD       | 1             |
|  BC       | 0             |
|  BD       | 0             |
|  BE       | 0             |
|  CE       | 0             |
+-----------+---------------+

Naturally, I would have tried a loop but I recognize mysql server is not optimal for this.

What I've tried is generating a temporary table by selecting for distinct name and startts and endts combinations and then doing a left join on the table itself (selecting name).

Thank you.


Solution

  • I understand this as a self-join, aggregation, and a conditional count of matching intervals:

    select t1.name name1, t2.name name2,
        sum(t1.startts = t2.startts and t1.endts = t2.endts) cnt
    from mytable t1
    inner join mytable t2 on t2.name > t1.name
    group by t1.name, t2.name
    order by t1.name, t2.name
    

    Demo on DB Fiddle:

    name1 | name2 | cnt
    :---- | :---- | --:
    A     | B     |   2
    A     | C     |   1
    A     | D     |   1
    A     | E     |   1
    B     | C     |   0
    B     | D     |   0
    B     | E     |   0
    C     | D     |   1
    C     | E     |   1
    D     | E     |   1
    

    Note that, if you are looking for a count of overlapping intervals, all you have to do is change the sum() to:

    sum(t1.startts <= t2.endts and t1.endts >= t2.startts) cnt