sql postgresql pattern-matching lookup-tables postgresql-performance

Efficient self-join on column value being the prefix of join column

How can I efficiently turn a set of strings into a table such that each strings is mapped to all other strings of which it is a prefix?

I use PostgreSQL 14 and deal with a table medical_codes of about 55,000 strings (medical codes of different classification systems such as ICD-10). Each is between 3 and 8 letters long. I'd now like to efficiently build a lookup table which maps each string a to all other strings b such that a is a prefix of b. E.g. given:

code
A04
A04.1
A04.2
A05
A06

I'd like to generate:

prefix	code
A04	A04
A04	A04.1
A04	A04.2
A04.1	A04.1
A04.2	A04.2
A05	A05
A06	A06

A naive implementation which works, but which is horribly inefficient, is

SELECT
    prefix_code.code AS prefix,
    specific_code.code AS code
FROM
    medical_codes AS prefix_code
    INNER JOIN medical_codes AS specific_code ON specific_code.code LIKE prefix_code.code || '%'

Alas, even a varchar_pattern_ops index on medical_codes(code) doesn't help here since the search pattern is constructed dynamically. I also tried taking the length of the codes into account:

CREATE INDEX medical_codes_code_length_idx ON medical_codes (length(code));

SELECT
    prefix_code.code AS prefix_id,
    specific_code.code AS code_id
FROM
    medical_codes AS prefix_code
    INNER JOIN medical_codes AS specific_code ON length(specific_code.code) >= length(prefix_code.code)
WHERE
    specific_code.code LIKE prefix_code.code || '%'

...which helps a bit, but it's still very slow.

Is there a more efficient way to do this? Maybe a query could be composed which performs a "divide and conquer" approach, starting with the shortest strings and then recursing for each of them to find longer matching strings?

Solution

I have a silver bullet for you, that isn't too easy to find:
Create an SP-GiST index and use the "starts-with" operator ^@

Index:

CREATE INDEX medical_codes_code_spgist_idx ON medical_codes USING spgist(code);

Query:

WITH prefix(code) AS (
   VALUES
     ('A04')
   , ('A04.1')
   , ('A04.2')
   , ('A05')
   , ('A06')
   )
SELECT p.code AS prefix, c.code
FROM   prefix p
LEFT   JOIN medical_codes c ON c.code ^@ p.code
ORDER  BY 1, 2;  -- optional

fiddle -- Postgres 16
fiddle -- Postgres 14

I consistently get times < 1 ms.

The operator ^@ is not documented in Postgres 14. But it's already there. It's documented (and improved) since Postgres 15. See:

PostgreSQL LIKE query performance variations

I had hoped it would work with a B-tree COLLATE "C" index in Postgres 16 as well. But for some reason I only get the great query plan with the SP-GiST index. Not sure why that is.