How can I efficiently turn a set of strings into a table such that each strings is mapped to all other strings of which it is a prefix?
I use PostgreSQL 14 and deal with a table medical_codes
of about 55,000 strings (medical codes of different classification systems such as ICD-10). Each is between 3 and 8 letters long. I'd now like to efficiently build a lookup table which maps each string a
to all other strings b
such that a
is a prefix of b
. E.g. given:
code |
---|
A04 |
A04.1 |
A04.2 |
A05 |
A06 |
I'd like to generate:
prefix | code |
---|---|
A04 | A04 |
A04 | A04.1 |
A04 | A04.2 |
A04.1 | A04.1 |
A04.2 | A04.2 |
A05 | A05 |
A06 | A06 |
A naive implementation which works, but which is horribly inefficient, is
SELECT
prefix_code.code AS prefix,
specific_code.code AS code
FROM
medical_codes AS prefix_code
INNER JOIN medical_codes AS specific_code ON specific_code.code LIKE prefix_code.code || '%'
Alas, even a varchar_pattern_ops
index on medical_codes(code)
doesn't help here since the search pattern is constructed dynamically. I also tried taking the length of the codes into account:
CREATE INDEX medical_codes_code_length_idx ON medical_codes (length(code));
SELECT
prefix_code.code AS prefix_id,
specific_code.code AS code_id
FROM
medical_codes AS prefix_code
INNER JOIN medical_codes AS specific_code ON length(specific_code.code) >= length(prefix_code.code)
WHERE
specific_code.code LIKE prefix_code.code || '%'
...which helps a bit, but it's still very slow.
Is there a more efficient way to do this? Maybe a query could be composed which performs a "divide and conquer" approach, starting with the shortest strings and then recursing for each of them to find longer matching strings?
I have a silver bullet for you, that isn't too easy to find:
Create an SP-GiST index and use the "starts-with" operator ^@
Index:
CREATE INDEX medical_codes_code_spgist_idx ON medical_codes USING spgist(code);
Query:
WITH prefix(code) AS (
VALUES
('A04')
, ('A04.1')
, ('A04.2')
, ('A05')
, ('A06')
)
SELECT p.code AS prefix, c.code
FROM prefix p
LEFT JOIN medical_codes c ON c.code ^@ p.code
ORDER BY 1, 2; -- optional
fiddle -- Postgres 16
fiddle -- Postgres 14
I consistently get times < 1 ms.
The operator ^@
is not documented in Postgres 14. But it's already there. It's documented (and improved) since Postgres 15. See:
I had hoped it would work with a B-tree COLLATE "C"
index in Postgres 16 as well. But for some reason I only get the great query plan with the SP-GiST index. Not sure why that is.