sql postgresql datetime subquery greatest-n-per-group

selecting DISTINCT rows that are ordered by something different

i have this table performances, where there are performances of theatre plays from other table:

id | play_id |    when    | other uninteresting meta data
1  |    3    | 2020-04-01 |             ...
2  |    4    | 2020-03-03 |             
3  |    3    | 2020-01-02 |             
4  |    1    | 2020-06-03 |             
5  |    4    | 2020-10-13 |

and i would like to select the earliest performance for each distinct play (so each play is represented by a single, earliest, performance), ordered from soonest to latest.

so from the presented data i would like to get this:

id | play_id |    when    | other uninteresting meta data
3  |    3    | 2020-01-02 |             ...
2  |    4    | 2020-03-03 |             
4  |    1    | 2020-06-03 |

so far, after studying some answers here and here i came up with this query


SELECT * FROM
(
  SELECT DISTINCT ON (play_id) *
  FROM performances
  WHERE performances.deleted_at is null
  ORDER BY performances.play_id ASC, performances.when ASC
) distinct_plays
order by distinct_plays.when ASC

however, i don't trust it at all, because in the linked threads there were people bickering and telling that each other's answers are wrong; and while i see some problems with the answers in that thread i don't see a problem in this answer yet.

is this a good query for my task? does it not select duplicate rows or is very ineffective?

Solution

Your query does what you want. distinct on is usually the right tool in Postgres to solve such greatest-n-per-group problem... alas, it gives no flexibility on the order of rows in the resultset.

It appears that you want a different sort in the result as in distinct on - so you need another level of nesting for this. While your code does what you want, I will recommend using row_number() instead (which has also the advantage of being supported in many databases, as opposed to vendor-specific distinct on):

SELECT *
FROM (
    SELECT p.*, ROW_NUMBER() OVER(PARTITION BY play_id ORDER BY p.when asc) rn
    FROM performances p
    WHERE p.deleted_at is null
) p
WHERE rn = 1
ORDER BY p.when asc

You might also want to try a correlated subquery:

SELECT p.*
FROM performances p
WHERE p.deleted_at IS NULL AND p.when = (
    SELECT MIN(p1.when) FROM performances p1 WHERE p1.play_id = p.play_id
)
ORDER BY p.when

For performance with the correlated subquery, consider an index on (play_id, when).