Search code examples
sqlpostgresqlgreatest-n-per-group

How to select collection of records, but only one per subgroup, in SQL?


I have a system basically like this:

CREATE TABLE entries (
  id SERIAL PRIMARY KEY,
  slug VARCHAR(50) NOT NULL
);

CREATE TABLE pronunciations (
  id SERIAL PRIMARY KEY,
  entry_id INTEGER NOT NULL,
  position INTEGER NOT NULL,
  text VARCHAR(50) NOT NULL,
  FOREIGN KEY (entry_id) REFERENCES entries (id) ON DELETE CASCADE
);

Entries can have many pronunciations, and they are ordered per entry by position.

How do I select 100 pronunciations (for 100 entries), even if there are say 1-5 pronunciations per entry? I want to basically get 1 pronunciation per entry in the final result, and only want the result to contain pronunciations. I want to pick the one with the lowest position, presumably position == 0. I will then (at the application layer), assign them to the entry record (which has other stuff I left out which is beside the point).

I found examples that essentially say to do something with WITH, along these lines:

WITH ranked_pronunciations AS (
    SELECT 
        p.id,
        p.entry_id,
        p.position,
        p.text,
        ROW_NUMBER() OVER (PARTITION BY p.entry_id ORDER BY p.position) AS rn
    FROM 
        pronunciations p
)

SELECT 
    rp.id,
    rp.entry_id,
    rp.position,
    rp.text
FROM 
    ranked_pronunciations rp
WHERE 
    rp.rn = 1
ORDER BY 
    rp.entry_id
LIMIT 100;

That seems awfully complex. Is there a way to do this in a simple way? I guess I could fetch all pronunciations and just filter out the first ones at the application layer, but that is a ways of data transfer.

I am using Kysely, a PostgreSQL Node.js query builder, so I'm not quite sure how to handle such a complex WITH statement in that, and wondering (a) if that's the required approach (in which case I'll figure out the Kysely syntax), or (b) if there's a simpler approach.


Solution

  • You are right, there is a much simpler way with DISTINCT ON:

    SELECT DISTINCT ON (entry_id) *
    FROM   pronunciations
    ORDER  BY entry_id, position  -- !
    LIMIT  100;
    

    Typically faster, too, for your data distribution ("1-5 pronunciations per entry"). See:

    Like in your original query, the table entries is not needed. But this:

    I will then (at the application layer), assign them to the entry record [...]

    ... sounds like it would be a lot faster integrated in a single query.