In a Postgres db (engine version 14.6), I have two tables: school_students
and id_cards
. They look like the following:
CREATE TABLE school_students (
id BIGSERIAL PRIMARY KEY,
school_id character varying NOT NULL,
student_id character varying NOT NULL
);
-- Indices -------------------------------------------------------
CREATE UNIQUE INDEX school_students_pkey ON school_students(id int8_ops);
CREATE UNIQUE INDEX school_students_school_student_idx ON school_students(school_id text_ops,student_id text_ops);
CREATE TABLE id_cards (
id BIGSERIAL PRIMARY KEY,
school_student_id bigint NOT NULL REFERENCES school_students(id)
);
-- Indices -------------------------------------------------------
CREATE UNIQUE INDEX id_cards_pkey ON id_cards(id int8_ops);
CREATE INDEX id_cards_school_student_id_idx ON id_cards(school_student_id int8_ops);
ID cards are periodically reissued to students, so there may be any number of cards per student. Each card is unique. Each student+school entity is unique. A student may belong to multiple schools.
There are currently 11 schools in the system with number of students in each ranging anywhere from ~200 to ~20000. Most students have one id card, but they may issued new cards at any time and there is no limit to how many id cards they may have.
Here is an fiddle with sample data.
In front of the DB is an API with a resource whose job it is to take a list of school_ids
and retrieve the list of the most recently created id cards for each school+student, limited to a select number of students per school_id
. The API is to be queried frequently enough that it must be performant.
I have the following query to get me what I need without any limiting:
SELECT id FROM id_cards
WHERE id IN (
SELECT MAX(c.id) FROM id_cards c
JOIN school_students ss ON c.school_student_id = ss.id
WHERE
school_id in (1, 2, 3) -- up to 100 ids per application code. in practice, there are only a dozen or so schools
GROUP BY
school_id,
student_id
);
However, in my sub-query, I'd like to limit the number of school+student records returned to 100 for each school_id
listed in the where clause in order to avoid runaway queries in instances where a given school may have a large number of students. For instance, the latest id card issued for up to the 50 of most recently added students for school 1, the id card for up to 50 of the most recently added students for school 2, repeat for each school listed in the query.
Is there a performant query that will accomplish this?
The mutilated relational model makes more efficient query techniques impossible.
The mentioned view is irrelevant for performance optimization. A view is basically just a stored query. It has no data nor indices of its own. A view is a convenience feature that doesn't help performance. (Not to be confused with a MATERIALIZED VIEW
.)
The main table id_cards
only features school_student_id
. There is no way to immediately deduce a school from this number. So there is also no way to add an index to speed up the query at hand. Indexes apply to one table in Postgres (and most other RDBMS). My initially suggested smart queries are useless without a matching index.
That said, for the moderately few rows in your DB, even a "brute force" query with a (single!) sequential scan on both tables (the only remaining option unless you add a MATERIALIZED VIEW
or change the relational model) is not that expensive. This should be as good as it gets:
SELECT card_id
FROM (
SELECT card_id
, row_number() OVER (PARTITION BY school_id ORDER BY card_id DESC ROWS UNBOUNDED PRECEDING) AS rn
FROM (
SELECT DISTINCT ON (s.school_id, s.student_id)
s.school_id, c.id AS card_id
FROM id_cards c
JOIN school_students s ON s.id = c.school_student_id
ORDER BY s.school_id, s.student_id, c.id DESC
) sub1
) sub2
WHERE rn <= 2; -- your limit
Related:
About DISTINCT ON
:
About the (optional) optimization with ROWS UNBOUNDED PRECEDING
:
If we know that all result rows are among the N most recent entries, we could work with that, and optimize above query:
SELECT card_id
FROM (
SELECT card_id
, row_number() OVER (PARTITION BY school_id ORDER BY card_id DESC ROWS UNBOUNDED PRECEDING) AS rn
FROM (
SELECT DISTINCT ON (s.school_id, s.student_id)
s.*, c.id AS card_id
FROM (
SELECT *
FROM id_cards
ORDER BY id DESC
LIMIT 1000 -- this covers the latest 2 for all schools (!)
) c
JOIN school_students s ON s.id = c.school_student_id
ORDER BY s.school_id, s.student_id, c.id DESC
) sub1
) sub2
WHERE rn <= 2;
Note that the (highly unrealistic!) data distribution in your fiddle adds all the latest rows for a single school, so this optimization would not apply!
Also, the increased limit of 100 (up from 2) in your updated question makes this optimization a lot less juicy (or even applicable).
Reading only the top N rows from an index (and sorting) is proportionally faster than reading a couple 100k rows. In this case (but not for the general query above), it would help to make the PK into a covering index and get index-only scans out of it like this:
ALTER TABLE id_cards
DROP CONSTRAINT id_cards_pkey
, ADD CONSTRAINT id_cards_pkey PRIMARY KEY (id) INCLUDE (school_student_id);
See:
Then again, selecting the "latest" entries based on a serial ID is unreliable to begin with. Concurrent write load and other internals can mess with the sequence. Such IDs are only guaranteed to be unique, not necessarily sequential. An added timestamp is more reasonable. (Plus rules how to break ties.)
BTW, Postgres 16 (and to some extent already pg 15) has multiple improvements over Postgres 14 that help the performance of these queries. Compare:
Among other things, ROWS UNBOUNDED PRECEDING
doesn't help any more, because the optimization is built-in now. (Doesn't hurt, either.)
(While still assuming a table school_id_cards
.)
If your table is big you want to avoid expensive whole-table sequential scans. Use a smart query to pick qualifying rows with index(-only) scans from a matching index. Hugely faster.
Typically, there should exist some kind of "school" table in your DB with exactly one row per relevant school. Makes the query simpler and faster:
WITH RECURSIVE latest_card AS (
SELECT c.*
FROM school s
CROSS JOIN LATERAL (
SELECT c.school_id, c.card_id, ARRAY[c.student_id] AS leading_ids
FROM school_id_cards c
WHERE c.school_id = s.school_id
ORDER BY c.card_id DESC
LIMIT 1
) c
UNION ALL
SELECT c.*
FROM latest_card l
JOIN LATERAL (
SELECT l.school_id, c.card_id, l.leading_ids || student_id
FROM school_id_cards c
WHERE c.school_id = l.school_id
AND c.card_id < l.card_id
AND c.student_id <> ALL (l.leading_ids)
ORDER BY c.card_id DESC
LIMIT 1
) C ON cardinality(l.leading_ids) < 2 -- your limit per school here!
)
SELECT card_id
FROM latest_card
ORDER BY card_id;
This scales well for a small limit per school like you demonstrated. For a large limit, I would switch to a different query.
About the use of a recursive CTE (rCTE):
Be sure to have a matching index like
CREATE INDEX ON school_id_cards (school_id DESC, card_id DESC);
A simpler index with (default) ascending sort order is barely any worse. Postgres can scan B-tree indices backwards. Only opposing sort order would be less ideal.
If there is no school
table:
WITH RECURSIVE latest_card AS (
(
SELECT DISTINCT ON (school_id)
school_id, card_id, ARRAY[student_id] AS leading_ids
FROM school_id_cards c
ORDER BY school_id DESC, card_id DESC
)
UNION ALL
SELECT c.*
FROM latest_card l
JOIN LATERAL (
SELECT l.school_id, c.card_id, l.leading_ids || student_id
FROM school_id_cards c
WHERE c.school_id = l.school_id
AND c.card_id < l.card_id
AND c.student_id <> ALL (l.leading_ids)
ORDER BY c.card_id DESC
LIMIT 1
) C ON cardinality(l.leading_ids) < 2 -- your limit per school here!
)
SELECT card_id
FROM latest_card
ORDER BY card_id;
About DISTINCT ON
:
You could replace the non-recursive term with another, nested rCTE to generate the list of schools (possibly with the latest card to kick things off) ...
But there really should be a school
table. Create it if you don't have it.