Search code examples
sqlpostgresqlpattern-matchinglookup-tablespostgresql-performance

Efficient self-join on column value being the prefix of join column


How can I efficiently turn a set of strings into a table such that each strings is mapped to all other strings of which it is a prefix?

I use PostgreSQL 14 and deal with a table medical_codes of about 55,000 strings (medical codes of different classification systems such as ICD-10). Each is between 3 and 8 letters long. I'd now like to efficiently build a lookup table which maps each string a to all other strings b such that a is a prefix of b. E.g. given:

code
A04
A04.1
A04.2
A05
A06

I'd like to generate:

prefix code
A04 A04
A04 A04.1
A04 A04.2
A04.1 A04.1
A04.2 A04.2
A05 A05
A06 A06

A naive implementation which works, but which is horribly inefficient, is

SELECT
    prefix_code.code AS prefix,
    specific_code.code AS code
FROM
    medical_codes AS prefix_code
    INNER JOIN medical_codes AS specific_code ON specific_code.code LIKE prefix_code.code || '%'

Alas, even a varchar_pattern_ops index on medical_codes(code) doesn't help here since the search pattern is constructed dynamically. I also tried taking the length of the codes into account:

CREATE INDEX medical_codes_code_length_idx ON medical_codes (length(code));

SELECT
    prefix_code.code AS prefix_id,
    specific_code.code AS code_id
FROM
    medical_codes AS prefix_code
    INNER JOIN medical_codes AS specific_code ON length(specific_code.code) >= length(prefix_code.code)
WHERE
    specific_code.code LIKE prefix_code.code || '%'

...which helps a bit, but it's still very slow.

Is there a more efficient way to do this? Maybe a query could be composed which performs a "divide and conquer" approach, starting with the shortest strings and then recursing for each of them to find longer matching strings?


Solution

  • I have a silver bullet for you, that isn't too easy to find:
    Create an SP-GiST index and use the "starts-with" operator ^@

    Index:

    CREATE INDEX medical_codes_code_spgist_idx ON medical_codes USING spgist(code);
    

    Query:

    WITH prefix(code) AS (
       VALUES
         ('A04')
       , ('A04.1')
       , ('A04.2')
       , ('A05')
       , ('A06')
       )
    SELECT p.code AS prefix, c.code
    FROM   prefix p
    LEFT   JOIN medical_codes c ON c.code ^@ p.code
    ORDER  BY 1, 2;  -- optional
    

    fiddle -- Postgres 16
    fiddle -- Postgres 14

    I consistently get times < 1 ms.

    The operator ^@ is not documented in Postgres 14. But it's already there. It's documented (and improved) since Postgres 15. See:

    I had hoped it would work with a B-tree COLLATE "C" index in Postgres 16 as well. But for some reason I only get the great query plan with the SP-GiST index. Not sure why that is.