Search code examples
sqlpostgresqldatetimesubquerygreatest-n-per-group

selecting DISTINCT rows that are ordered by something different


i have this table performances, where there are performances of theatre plays from other table:

id | play_id |    when    | other uninteresting meta data
1  |    3    | 2020-04-01 |             ...
2  |    4    | 2020-03-03 |             
3  |    3    | 2020-01-02 |             
4  |    1    | 2020-06-03 |             
5  |    4    | 2020-10-13 |             

and i would like to select the earliest performance for each distinct play (so each play is represented by a single, earliest, performance), ordered from soonest to latest.

so from the presented data i would like to get this:

id | play_id |    when    | other uninteresting meta data
3  |    3    | 2020-01-02 |             ...
2  |    4    | 2020-03-03 |             
4  |    1    | 2020-06-03 |             

so far, after studying some answers here and here i came up with this query


SELECT * FROM
(
  SELECT DISTINCT ON (play_id) *
  FROM performances
  WHERE performances.deleted_at is null
  ORDER BY performances.play_id ASC, performances.when ASC
) distinct_plays
order by distinct_plays.when ASC

however, i don't trust it at all, because in the linked threads there were people bickering and telling that each other's answers are wrong; and while i see some problems with the answers in that thread i don't see a problem in this answer yet.

is this a good query for my task? does it not select duplicate rows or is very ineffective?


Solution

  • Your query does what you want. distinct on is usually the right tool in Postgres to solve such greatest-n-per-group problem... alas, it gives no flexibility on the order of rows in the resultset.

    It appears that you want a different sort in the result as in distinct on - so you need another level of nesting for this. While your code does what you want, I will recommend using row_number() instead (which has also the advantage of being supported in many databases, as opposed to vendor-specific distinct on):

    SELECT *
    FROM (
        SELECT p.*, ROW_NUMBER() OVER(PARTITION BY play_id ORDER BY p.when asc) rn
        FROM performances p
        WHERE p.deleted_at is null
    ) p
    WHERE rn = 1
    ORDER BY p.when asc
    

    You might also want to try a correlated subquery:

    SELECT p.*
    FROM performances p
    WHERE p.deleted_at IS NULL AND p.when = (
        SELECT MIN(p1.when) FROM performances p1 WHERE p1.play_id = p.play_id
    )
    ORDER BY p.when
    

    For performance with the correlated subquery, consider an index on (play_id, when).