Search code examples
sqlpostgresqlalgorithmaggregation

Find the longest prefix for a table of tokenized strings


I have a table (n < 100) with '#'-separated values in Postgres 13+. Sample:

id  value
1   'ABC#DEF#123#BAIT'
2   'ABC#DEF#156#BAIT'
3   'ABC#DEF#178#BAIT'
...

These values can be of any length (e.g. 'ABC#DEFDEF#GHIG#123123#456#678').
There are no NULL values.

I need the longest prefix. Entire tokens only, no partial match per token. Matching from the start of the string, and matching in all rows.

Result for the above sample would be:

id   output
1    'ABC#DEF'  -- or 'ABC#DEF#'

Another example:

id  value
1   'ABC#123'
2   'ADE#456'

Should return:

id   output
1    ''

Since the tokens 'ABC' and 'ADE' do not match completely, and should not return 'A'.

Currently I'm using:

CREATE OR REPLACE FUNCTION lcp_iterate(_state TEXT, value TEXT)
RETURNS TEXT
AS
$$
    SELECT  SUBSTRING($2, 1, s - 1)
    FROM    generate_series(1, LEAST(LENGTH($1), LENGTH($2))) s
    WHERE   SUBSTRING($1, 1, s) <> SUBSTRING($2, 1, s)
    UNION ALL
    SELECT  LEAST($1, $2)
    LIMIT 1;
$$
LANGUAGE 'sql';

DO $$ BEGIN
    CREATE AGGREGATE lcp(TEXT) (SFUNC = lcp_iterate, STYPE = TEXT);
EXCEPTION
    when sqlstate '42723' then null;
END $$;

But it returns:

row  output
1    'A'

It would be extremely simple in any programming language, but I'm struggling to find a way to do it in Postgres.

PS: I'm trying to keep my code base as readable as possible so I'd like to avoid nesting 20 CASE items. (Since I can't have more than 20 '#'.)

Basic query to try:

WITH source AS (
  SELECT 'ABC#DEF#123' AS value
  UNION ALL 
  SELECT 'ABC#DEF#456'
)

SELECT my_func(*) 
FROM source
GROUP BY value

Solution

  • Assuming id is the PRIMARY KEY, and value is defined NOT NULL, or you need to declare how to deal with null values (and do more).

    RECURSIVE CTE

    WITH RECURSIVE cte AS (
       (
       SELECT id, string_to_array(value, '#') AS arr, 0 AS idx
       FROM   tbl
       ORDER  BY id
       LIMIT  1
       )
    
       UNION ALL
       SELECT c.id, c.arr, c.idx + 1
       FROM   cte c
       WHERE  c.idx < cardinality(c.arr)  -- end of string
       AND    NOT EXISTS (                -- mismatch
          SELECT FROM tbl t
          WHERE  t.id > c.id
          AND    t.value !~ ('^' || array_to_string(c.arr[1:idx+1], '#') || '(#|$)')  -- trailing '#' or end
          )
       )
    SELECT array_to_string(arr[1:idx], '#') AS result
    FROM   cte
    ORDER  BY idx DESC
    LIMIT  1;
    

    fiddle

    Pick the first row.
    Then for every substring in value check if there is any other row that disagrees. Keep looping till the first mismatch or end of string.
    The last iteration yields the longest common prefix. (For a single row it's the whole value.)

    I formulated the predicate with a left-anchored negated regex match (!~) to employ an index (!!!) - which is essential to make this scale. There are various candidates: a trigram GIN index, or a plain B-tree index with COLLATE "C". See:

    Or a more specialized (properly customized!) text-search index. See:

    In recent versions an SP-GiST index should work very efficiently too.

    The beauty:

    1. Can use an index, so the the size of the table matters little. I.e it scales excellently.
    2. Stops work as soon as the first mismatch is found, and does not look at the rest, however long that may be.
    3. Works for all corner cases.

    The beast:

    1. Added complexity from the rCTE.
    2. The overhead won't pay for small tables with short strings. Test to find your sweat spot.

    string_to_table() and count

    SELECT string_agg(token, '#') AS result
    FROM  (
       SELECT token, idx
            , row_number() OVER (ORDER BY idx) AS rn
       FROM   tbl, string_to_table(value, '#') WITH ORDINALITY AS a(token, idx)
       GROUP  BY token, idx
       HAVING count(token) = (SELECT count(*) FROM tbl)
       ORDER  BY idx
       ) sub
    WHERE  idx = rn;
    

    fiddle

    This uses the same basic approach as JNevill's query. Fixes a corner case error and should be faster.

    Unnest tokens into separate rows, remembering their position (idx). See:

    Aggregate tokens per position and only keep those that reach the total count. At the same time, take note of the row_number() for each qualifying position (rn).
    Concatenate substrings after excluding positions past the first gap (idx > rn - see demo in fiddle!) to arrive at the longest prefix.

    Should be faster than the first query for small tables and short strings. But scales with the number of rows x the number of positions in the string, i.e. poorly.

    Related: