sql postgresql aggregate-functions common-table-expression relational-division

Find spectators that have seen the same shows (match multiple rows for each)

For an assignment I have to write several SQL queries for a database stored in a PostgreSQL server running PostgreSQL 9.3.0. However, I find myself blocked with last query. The database models a reservation system for an opera house. The query is about associating the a spectator the other spectators that assist to the same events every time.

The model looks like this:

Reservations table 
id_res |     create_date     |  tickets_presented  | id_show | id_spectator | price | category 
-------+---------------------+---------------------+---------+--------------+-------+----------
     1 | 2015-08-05 17:45:03 |                     |       1 |            1 |   195 |        1
     2 | 2014-03-15 14:51:08 | 2014-11-30 14:17:00 |      11 |            1 |   150 |        2

Spectators table

id_spectator   | last_name  | first_name |                email                   |     create_time     | age 
---------------+------------+------------+----------------------------------------+---------------------+-----   
             1 | gonzalez   | colin      | colin.gonzalez@gmail.com               | 2014-03-15 14:21:30 |  22
             2 | bequet     | camille    | bequet.camille@gmail.com               | 2014-12-10 15:22:31 |  22

Shows table
 id_show |          name          |  kind  | presentation_date | start_time | end_time | id_season | capacity_cat1 | capacity_cat2 | capacity_cat3 | price_cat1 | price_cat2 | price_cat3 
---------+------------------------+--------+-------------------+------------+----------+-----------+---------------+---------------+---------------+------------+------------+------------
       1 | madama butterfly       | opera  | 2015-09-05        | 19:30:00   | 21:30:00 |         2 |           315 |           630 |           945 |        195 |        150 |        100
       2 | don giovanni           | opera  | 2015-09-12        | 19:30:00   | 21:45:00 |         2 |           315 |           630 |           945 |        195 |        150 |        100

So far I've started by writing a query to get the id of the spectator and the date of the show he's attending to, the query looks like this.

SELECT Reservations.id_spectator, Shows.presentation_date
FROM Reservations
LEFT JOIN Shows ON Reservations.id_show = Shows.id_show;

Could someone help me understand better the problem and hint me towards finding a solution. Thanks in advance.

So the result I'm expecting should be something like this

id_spectator | other_id_spectators
-------------+--------------------
            1|                 2,3

Meaning that every time spectator with id 1 went to a show, spectators 2 and 3 did too.

Solution

Meaning that every time spectator with id 1 went to a show, spectators 2 and 3 did too.

In other words, you want a list of ...
all spectators that have seen all the shows that a given spectator has seen (and possibly more than the given one)

This is a special case of relational division. We have assembled an arsenal of basic techniques here:

How to filter SQL results in a has-many-through relation

It is special because the list of shows each spectator has to have attended is dynamically determined by the given prime spectator.

Assuming that (d_spectator, id_show) is unique in reservations, which has not been clarified.

A UNIQUE constraint on those two columns (in that order) also provides the most important index.
For best performance in query 2 and 3 below also create an index with leading id_show.

1. Brute force

The primitive approach would be to form a sorted array of shows the given user has seen and compare the same array of others:

SELECT 1 AS id_spectator, array_agg(sub.id_spectator) AS id_other_spectators
FROM  (
   SELECT id_spectator
   FROM   reservations r
   WHERE  id_spectator <> 1
   GROUP  BY 1
   HAVING        array_agg(id_show ORDER BY id_show)
      @> (SELECT array_agg(id_show ORDER BY id_show)
          FROM   reservations
          WHERE  id_spectator = 1)
   ) sub;

But this is potentially very expensive for big tables. The whole table hast to be processes, and in a rather expensive way, too.

2. Smarter

Use a CTE to determine relevant shows, then only consider those

WITH shows AS (             -- all shows of id 1; 1 row per show
   SELECT id_spectator, id_show
   FROM   reservations
   WHERE  id_spectator = 1  -- your prime spectator here
   )
SELECT sub.id_spectator, array_agg(sub.other) AS id_other_spectators
FROM  (
   SELECT s.id_spectator, r.id_spectator AS other
   FROM   shows s
   JOIN   reservations r USING (id_show)
   WHERE  r.id_spectator <> s.id_spectator
   GROUP  BY 1,2
   HAVING count(*) = (SELECT count(*) FROM shows)
   ) sub
GROUP  BY 1;

@> is the "contains2 operator for arrays - so we get all spectators that have at least seen the same shows.

Faster than 1. because only relevant shows are considered.

3. Real smart

To also exclude spectators that are not going to qualify early from the query, use a recursive CTE:

WITH RECURSIVE shows AS (   -- produces exactly 1 row
   SELECT id_spectator, array_agg(id_show) AS shows, count(*) AS ct
   FROM   reservations
   WHERE  id_spectator = 1  -- your prime spectator here
   GROUP  BY 1
   )
, cte AS (
   SELECT r.id_spectator, 1 AS idx
   FROM   shows s
   JOIN   reservations r ON r.id_show = s.shows[1]
   WHERE  r.id_spectator <> s.id_spectator

   UNION  ALL
   SELECT r.id_spectator, idx + 1
   FROM   cte c
   JOIN   reservations r USING (id_spectator)
   JOIN   shows s ON s.shows[c.idx + 1] = r.id_show
   )
SELECT s.id_spectator, array_agg(c.id_spectator) AS id_other_spectators
FROM   shows s
JOIN   cte c ON c.idx = s.ct  -- has an entry for every show
GROUP  BY 1;

Note that the first CTE is non-recursive. Only the second part is recursive (iterative really).

This should be fastest for small selections from big tables. Row that don't qualify are excluded early. the two indices I mentioned are essential.

SQL Fiddle demonstrating all three.