Search code examples
sqlpostgresqlgreatest-n-per-groupdistinct-on

Select N latest rows per school, but skip duplicates from the same student


In a Postgres db (engine version 14.6), I have two tables: school_students and id_cards. They look like the following:

CREATE TABLE school_students (
    id BIGSERIAL PRIMARY KEY,
    school_id character varying NOT NULL,
    student_id character varying NOT NULL
);

-- Indices -------------------------------------------------------

CREATE UNIQUE INDEX school_students_pkey ON school_students(id int8_ops);
CREATE UNIQUE INDEX school_students_school_student_idx ON school_students(school_id text_ops,student_id text_ops);

CREATE TABLE id_cards (
    id BIGSERIAL PRIMARY KEY,
    school_student_id bigint NOT NULL REFERENCES school_students(id)
);

-- Indices -------------------------------------------------------

CREATE UNIQUE INDEX id_cards_pkey ON id_cards(id int8_ops);
CREATE INDEX id_cards_school_student_id_idx ON id_cards(school_student_id int8_ops);

ID cards are periodically reissued to students, so there may be any number of cards per student. Each card is unique. Each student+school entity is unique. A student may belong to multiple schools.

There are currently 11 schools in the system with number of students in each ranging anywhere from ~200 to ~20000. Most students have one id card, but they may issued new cards at any time and there is no limit to how many id cards they may have.

Here is an fiddle with sample data.

In front of the DB is an API with a resource whose job it is to take a list of school_ids and retrieve the list of the most recently created id cards for each school+student, limited to a select number of students per school_id. The API is to be queried frequently enough that it must be performant.

I have the following query to get me what I need without any limiting:

SELECT id FROM id_cards
WHERE id IN (
  SELECT MAX(c.id) FROM id_cards c
  JOIN school_students ss ON c.school_student_id = ss.id
  WHERE
    school_id in (1, 2, 3) -- up to 100 ids per application code. in practice, there are only a dozen or so schools
  GROUP BY
    school_id,
    student_id
);

However, in my sub-query, I'd like to limit the number of school+student records returned to 100 for each school_id listed in the where clause in order to avoid runaway queries in instances where a given school may have a large number of students. For instance, the latest id card issued for up to the 50 of most recently added students for school 1, the id card for up to 50 of the most recently added students for school 2, repeat for each school listed in the query.

Is there a performant query that will accomplish this?


Solution

  • The mutilated relational model makes more efficient query techniques impossible.

    The mentioned view is irrelevant for performance optimization. A view is basically just a stored query. It has no data nor indices of its own. A view is a convenience feature that doesn't help performance. (Not to be confused with a MATERIALIZED VIEW.)

    The main table id_cards only features school_student_id. There is no way to immediately deduce a school from this number. So there is also no way to add an index to speed up the query at hand. Indexes apply to one table in Postgres (and most other RDBMS). My initially suggested smart queries are useless without a matching index.

    That said, for the moderately few rows in your DB, even a "brute force" query with a (single!) sequential scan on both tables (the only remaining option unless you add a MATERIALIZED VIEW or change the relational model) is not that expensive. This should be as good as it gets:

    SELECT card_id
    FROM  (
       SELECT card_id
            , row_number() OVER (PARTITION BY school_id ORDER BY card_id DESC ROWS UNBOUNDED PRECEDING) AS rn
       FROM  (
          SELECT DISTINCT ON (s.school_id, s.student_id)
                 s.school_id, c.id AS card_id
          FROM   id_cards c
          JOIN   school_students s ON s.id = c.school_student_id
          ORDER  BY s.school_id, s.student_id, c.id DESC
          ) sub1
       ) sub2
    WHERE  rn <= 2; -- your limit
    

    fiddle

    Related:

    About DISTINCT ON:

    About the (optional) optimization with ROWS UNBOUNDED PRECEDING:

    If we know that all result rows are among the N most recent entries, we could work with that, and optimize above query:

    SELECT card_id
    FROM  (
       SELECT card_id
            , row_number() OVER (PARTITION BY school_id ORDER BY card_id DESC ROWS UNBOUNDED PRECEDING) AS rn
       FROM  (
          SELECT DISTINCT ON (s.school_id, s.student_id)
                 s.*, c.id AS card_id
          FROM  (
             SELECT *
             FROM   id_cards
             ORDER  BY id DESC
             LIMIT  1000         -- this covers the latest 2 for all schools (!)
             ) c
          JOIN   school_students s ON s.id = c.school_student_id
          ORDER  BY s.school_id, s.student_id, c.id DESC
          ) sub1
       ) sub2
    WHERE  rn <= 2;

    Note that the (highly unrealistic!) data distribution in your fiddle adds all the latest rows for a single school, so this optimization would not apply!

    Also, the increased limit of 100 (up from 2) in your updated question makes this optimization a lot less juicy (or even applicable).

    Reading only the top N rows from an index (and sorting) is proportionally faster than reading a couple 100k rows. In this case (but not for the general query above), it would help to make the PK into a covering index and get index-only scans out of it like this:

    ALTER TABLE id_cards
      DROP CONSTRAINT id_cards_pkey
    , ADD  CONSTRAINT id_cards_pkey PRIMARY KEY (id) INCLUDE (school_student_id);
    

    See:

    Then again, selecting the "latest" entries based on a serial ID is unreliable to begin with. Concurrent write load and other internals can mess with the sequence. Such IDs are only guaranteed to be unique, not necessarily sequential. An added timestamp is more reasonable. (Plus rules how to break ties.)

    BTW, Postgres 16 (and to some extent already pg 15) has multiple improvements over Postgres 14 that help the performance of these queries. Compare:

    fiddle

    Among other things, ROWS UNBOUNDED PRECEDING doesn't help any more, because the optimization is built-in now. (Doesn't hurt, either.)

    Outdated answer for initial question

    (While still assuming a table school_id_cards.)

    If your table is big you want to avoid expensive whole-table sequential scans. Use a smart query to pick qualifying rows with index(-only) scans from a matching index. Hugely faster.

    Typically, there should exist some kind of "school" table in your DB with exactly one row per relevant school. Makes the query simpler and faster:

    WITH RECURSIVE latest_card AS (
       SELECT c.*
       FROM   school s
       CROSS  JOIN LATERAL (
          SELECT c.school_id, c.card_id, ARRAY[c.student_id] AS leading_ids
          FROM   school_id_cards c
          WHERE  c.school_id = s.school_id
          ORDER  BY c.card_id DESC
          LIMIT  1
          ) c
    
       UNION ALL
       SELECT c.*
       FROM   latest_card l
       JOIN   LATERAL (
          SELECT l.school_id, c.card_id, l.leading_ids || student_id
          FROM   school_id_cards c
          WHERE  c.school_id = l.school_id
          AND    c.card_id < l.card_id
          AND    c.student_id <> ALL (l.leading_ids)
          ORDER  BY c.card_id DESC
          LIMIT  1
          ) C ON cardinality(l.leading_ids) < 2  -- your limit per school here!
       )
    SELECT card_id
    FROM   latest_card
    ORDER  BY card_id;
    

    fiddle

    This scales well for a small limit per school like you demonstrated. For a large limit, I would switch to a different query.

    About the use of a recursive CTE (rCTE):

    Be sure to have a matching index like

    CREATE INDEX ON school_id_cards (school_id DESC, card_id DESC);
    

    A simpler index with (default) ascending sort order is barely any worse. Postgres can scan B-tree indices backwards. Only opposing sort order would be less ideal.

    If there is no school table:

    WITH RECURSIVE latest_card AS (
       (
       SELECT DISTINCT ON (school_id)
              school_id, card_id, ARRAY[student_id] AS leading_ids
       FROM   school_id_cards c
       ORDER  BY school_id DESC, card_id DESC
       )
    
       UNION ALL
       SELECT c.*
       FROM   latest_card l
       JOIN   LATERAL (
          SELECT l.school_id, c.card_id, l.leading_ids || student_id
          FROM   school_id_cards c
          WHERE  c.school_id = l.school_id
          AND    c.card_id < l.card_id
          AND    c.student_id <> ALL (l.leading_ids)
          ORDER  BY c.card_id DESC
          LIMIT  1
          ) C ON cardinality(l.leading_ids) < 2  -- your limit per school here!
       )
    SELECT card_id
    FROM   latest_card
    ORDER  BY card_id;
    

    About DISTINCT ON:

    You could replace the non-recursive term with another, nested rCTE to generate the list of schools (possibly with the latest card to kick things off) ...
    But there really should be a school table. Create it if you don't have it.