Search code examples
sqlpostgresqlaggregate-functionscommon-table-expressionrelational-division

Find spectators that have seen the same shows (match multiple rows for each)


For an assignment I have to write several SQL queries for a database stored in a PostgreSQL server running PostgreSQL 9.3.0. However, I find myself blocked with last query. The database models a reservation system for an opera house. The query is about associating the a spectator the other spectators that assist to the same events every time.

The model looks like this:

Reservations table 
id_res |     create_date     |  tickets_presented  | id_show | id_spectator | price | category 
-------+---------------------+---------------------+---------+--------------+-------+----------
     1 | 2015-08-05 17:45:03 |                     |       1 |            1 |   195 |        1
     2 | 2014-03-15 14:51:08 | 2014-11-30 14:17:00 |      11 |            1 |   150 |        2

Spectators table

id_spectator   | last_name  | first_name |                email                   |     create_time     | age 
---------------+------------+------------+----------------------------------------+---------------------+-----   
             1 | gonzalez   | colin      | colin.gonzalez@gmail.com               | 2014-03-15 14:21:30 |  22
             2 | bequet     | camille    | bequet.camille@gmail.com               | 2014-12-10 15:22:31 |  22

Shows table
 id_show |          name          |  kind  | presentation_date | start_time | end_time | id_season | capacity_cat1 | capacity_cat2 | capacity_cat3 | price_cat1 | price_cat2 | price_cat3 
---------+------------------------+--------+-------------------+------------+----------+-----------+---------------+---------------+---------------+------------+------------+------------
       1 | madama butterfly       | opera  | 2015-09-05        | 19:30:00   | 21:30:00 |         2 |           315 |           630 |           945 |        195 |        150 |        100
       2 | don giovanni           | opera  | 2015-09-12        | 19:30:00   | 21:45:00 |         2 |           315 |           630 |           945 |        195 |        150 |        100

So far I've started by writing a query to get the id of the spectator and the date of the show he's attending to, the query looks like this.

SELECT Reservations.id_spectator, Shows.presentation_date
FROM Reservations
LEFT JOIN Shows ON Reservations.id_show = Shows.id_show;

Could someone help me understand better the problem and hint me towards finding a solution. Thanks in advance.

So the result I'm expecting should be something like this

id_spectator | other_id_spectators
-------------+--------------------
            1|                 2,3

Meaning that every time spectator with id 1 went to a show, spectators 2 and 3 did too.


Solution

  • Meaning that every time spectator with id 1 went to a show, spectators 2 and 3 did too.

    In other words, you want a list of ...
    all spectators that have seen all the shows that a given spectator has seen (and possibly more than the given one)

    This is a special case of relational division. We have assembled an arsenal of basic techniques here:

    It is special because the list of shows each spectator has to have attended is dynamically determined by the given prime spectator.

    Assuming that (d_spectator, id_show) is unique in reservations, which has not been clarified.

    A UNIQUE constraint on those two columns (in that order) also provides the most important index.
    For best performance in query 2 and 3 below also create an index with leading id_show.

    1. Brute force

    The primitive approach would be to form a sorted array of shows the given user has seen and compare the same array of others:

    SELECT 1 AS id_spectator, array_agg(sub.id_spectator) AS id_other_spectators
    FROM  (
       SELECT id_spectator
       FROM   reservations r
       WHERE  id_spectator <> 1
       GROUP  BY 1
       HAVING        array_agg(id_show ORDER BY id_show)
          @> (SELECT array_agg(id_show ORDER BY id_show)
              FROM   reservations
              WHERE  id_spectator = 1)
       ) sub;
    

    But this is potentially very expensive for big tables. The whole table hast to be processes, and in a rather expensive way, too.

    2. Smarter

    Use a CTE to determine relevant shows, then only consider those

    WITH shows AS (             -- all shows of id 1; 1 row per show
       SELECT id_spectator, id_show
       FROM   reservations
       WHERE  id_spectator = 1  -- your prime spectator here
       )
    SELECT sub.id_spectator, array_agg(sub.other) AS id_other_spectators
    FROM  (
       SELECT s.id_spectator, r.id_spectator AS other
       FROM   shows s
       JOIN   reservations r USING (id_show)
       WHERE  r.id_spectator <> s.id_spectator
       GROUP  BY 1,2
       HAVING count(*) = (SELECT count(*) FROM shows)
       ) sub
    GROUP  BY 1;
    

    @> is the "contains2 operator for arrays - so we get all spectators that have at least seen the same shows.

    Faster than 1. because only relevant shows are considered.

    3. Real smart

    To also exclude spectators that are not going to qualify early from the query, use a recursive CTE:

    WITH RECURSIVE shows AS (   -- produces exactly 1 row
       SELECT id_spectator, array_agg(id_show) AS shows, count(*) AS ct
       FROM   reservations
       WHERE  id_spectator = 1  -- your prime spectator here
       GROUP  BY 1
       )
    , cte AS (
       SELECT r.id_spectator, 1 AS idx
       FROM   shows s
       JOIN   reservations r ON r.id_show = s.shows[1]
       WHERE  r.id_spectator <> s.id_spectator
    
       UNION  ALL
       SELECT r.id_spectator, idx + 1
       FROM   cte c
       JOIN   reservations r USING (id_spectator)
       JOIN   shows s ON s.shows[c.idx + 1] = r.id_show
       )
    SELECT s.id_spectator, array_agg(c.id_spectator) AS id_other_spectators
    FROM   shows s
    JOIN   cte c ON c.idx = s.ct  -- has an entry for every show
    GROUP  BY 1;
    

    Note that the first CTE is non-recursive. Only the second part is recursive (iterative really).

    This should be fastest for small selections from big tables. Row that don't qualify are excluded early. the two indices I mentioned are essential.

    SQL Fiddle demonstrating all three.