Search code examples
sqlarrayspostgresqlmany-to-manyaggregate

Query to count the frequence of many-to-many associations


I have two tables with a many-to-many association in postgresql. The first table contains activities, which may count zero or more reasons:

CREATE TABLE activity (
   id integer NOT NULL,
   -- other fields removed for readability
);

CREATE TABLE reason (
   id varchar(1) NOT NULL,
   -- other fields here
);

For performing the association, a join table exists between those two tables:

CREATE TABLE activity_reason (
   activity_id integer NOT NULL, -- refers to activity.id
   reason_id varchar(1) NOT NULL, -- refers to reason.id
   CONSTRAINT activity_reason_activity FOREIGN KEY (activity_id) REFERENCES activity (id),
  CONSTRAINT activity_reason_reason FOREIGN KEY (reason_id) REFERENCES reason (id)
);

I would like to count the possible association between activities and reasons. Supposing I have those records in the table activity_reason:

+--------------+------------+
| activity_id  |  reason_id |
+--------------+------------+
|           1  |          A |
|           1  |          B |
|           2  |          A |
|           2  |          B |
|           3  |          A |
|           4  |          C |
|           4  |          D |
|           4  |          E |
+--------------+------------+

I should have something like:

+-------+---+------+-------+
| count |   |      |       |
+-------+---+------+-------+
|     2 | A | B    | NULL  |
|     1 | A | NULL | NULL  |
|     1 | C | D    | E     |
+-------+---+------+-------+

Or, eventually, something like :

+-------+-------+
| count |       |
+-------+-------+
|     2 | A,B   |
|     1 | A     |
|     1 | C,D,E |
+-------+-------+

I can't find the SQL query to do this.


Solution

  • We need to compare sorted lists of reasons to identify equal sets.

    SELECT count(*) AS ct, reason_list
    FROM  (
       SELECT array_agg(reason_id) AS reason_list
       FROM  (SELECT * FROM activity_reason ORDER BY activity_id, reason_id) ar1
       GROUP  BY activity_id
       ) ar2
    GROUP  BY reason_list
    ORDER  BY ct DESC, reason_list;
    

    ORDER BY reason_id in the innermost subquery would work, too, but adding activity_id is typically faster.

    And we don't strictly need the innermost subquery at all. This works as well:

    SELECT count(*) AS ct, reason_list
    FROM  (
       SELECT array_agg(reason_id ORDER BY reason_id) AS reason_list
       FROM   activity_reason
       GROUP  BY activity_id
       ) ar2
    GROUP  BY reason_list
    ORDER  BY ct DESC, reason_list;
    

    But it's typically slower for processing all or most of the table. Quoting the manual:

    Alternatively, supplying the input values from a sorted subquery will usually work.

    We could use string_agg() instead of array_agg(), and that would work for your example with varchar(1) (which might be more efficient with data type "char", btw). It can fail for longer strings, though. The aggregated value can be ambiguous.


    If reason_id would be an integer (like it typically is), there is another, faster solution with sort() from the additional module intarray:

    SELECT count(*) AS ct, reason_list
    FROM  (
       SELECT sort(array_agg(reason_id)) AS reason_list
       FROM   activity_reason2
       GROUP  BY activity_id
       ) ar2
    GROUP  BY reason_list
    ORDER  BY ct DESC, reason_list;
    

    Related, with more explanation: