Search code examples
postgresqlconditional-statementsinner-joinaggregation

How to compute frequency/count of concurrent events by combination in postgresql?


I am looking for a way to identify event names names that co-occur: i.e., correlate event names with the same start (startts) and end (endts) times: the events are exactly concurrent (partial overlap is not a feature of this data base, which makes this conditional criterion a bit simpler to satisfy).

toy dataframe

+------------------+
|name startts endts|
| A   02:20  02:23 |
| A   02:23  02:25 |
| A   02:27  02:28 |
| B   02:20  02:23 |
| B   02:23  02:25 |
| B   02:25  02:27 |
| C   02:27  02:28 |
| D   02:27  02:28 |
| D   02:28  02:31 |
| E   02:27  02:28 |
| E   02:29  02:31 |
+------------------+

Ideal output:


+---------------------------+
|combination| count         |
+---------------------------+
|  AB       | 2             |
|  AC       | 1             |
|  AE       | 1             |
|  AD       | 1             |
|  BC       | 0             |
|  BD       | 0             |
|  BE       | 0             |
|  CE       | 0             |
+-----------+---------------+

Naturally, I would have tried a loop but I recognize PostgreSQL is not optimal for this.

What I've tried is generating a temporary table by selecting for distinct name and startts and endts combinations and then doing a left join on the table itself (selecting name).

User @GMB provided the following (modified) solution; however, the performance is not satisfactory given the size of the database (even running the query on a time window of 10 minutes never completes). For context, there are about 300-400 unique names; so about 80200 combinations (if my math checks out). Order is not important for the permutations.

@GMB's attempt: I understand this as a self-join, aggregation, and a conditional count of matching intervals:

    select t1.name name1, t2.name name2,
        sum(case when t1.startts = t2.startts and t1.endts = t2.endts then 1 else 0 end) cnt
    from mytable t1
    inner join mytable t2 on t2.name > t1.name
    group by t1.name, t2.name
    order by t1.name, t2.name

Demo on DB Fiddle:

name1 | name2 | cnt
:---- | :---- | --:
A     | B     |   2
A     | C     |   1
A     | D     |   1
A     | E     |   1
B     | C     |   0
B     | D     |   0
B     | E     |   0
C     | D     |   1
C     | E     |   1
D     | E     |   1

@GMB notes that, if you are looking for a count of overlapping intervals, all you have to do is change the sum() to:

    sum(t1.startts <= t2.endts and t1.endts >= t2.startts) cnt

Version = PostgreSQL 8.0.2 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 3.4.2 20041017 (Red Hat 3.4.2-6.fc3), Redshift 1.0.19097

Thank you.


Solution

  • Consider the following in MySQL (where your DBFiddle points to):

    SELECT name, COUNT(*)
    FROM (
        SELECT group_concat(name ORDER BY name) name
        FROM mytable
        GROUP BY startts, endts
        ORDER BY name
    ) as names
    GROUP BY name
    ORDER BY name
    

    Equivalent in PostgreSQL:

    SELECT name, COUNT(*)
    FROM (
        SELECT string_agg(name ORDER BY name) name
        FROM mytable
        GROUP BY startts, endts
        ORDER BY name
    ) as names
    GROUP BY name
    ORDER BY name
    

    First, you create a list of concurrent events (in the subquery), and then you count them.