Find the longest prefix for a table of tokenized strings

I have a table (n < 100) with '#'-separated values in Postgres 13+. Sample:

id  value
1   'ABC#DEF#123#BAIT'
2   'ABC#DEF#156#BAIT'
3   'ABC#DEF#178#BAIT'
...

These values can be of any length (e.g. 'ABC#DEFDEF#GHIG#123123#456#678').
There are no NULL values.

I need the longest prefix. Entire tokens only, no partial match per token. Matching from the start of the string, and matching in all rows.

Result for the above sample would be:

id   output
1    'ABC#DEF'  -- or 'ABC#DEF#'

Another example:

id  value
1   'ABC#123'
2   'ADE#456'

Should return:

id   output
1    ''

Since the tokens 'ABC' and 'ADE' do not match completely, and should not return 'A'.

Currently I'm using:

CREATE OR REPLACE FUNCTION lcp_iterate(_state TEXT, value TEXT)
RETURNS TEXT
AS
$$
    SELECT  SUBSTRING($2, 1, s - 1)
    FROM    generate_series(1, LEAST(LENGTH($1), LENGTH($2))) s
    WHERE   SUBSTRING($1, 1, s) <> SUBSTRING($2, 1, s)
    UNION ALL
    SELECT  LEAST($1, $2)
    LIMIT 1;
$$
LANGUAGE 'sql';

DO $$ BEGIN
    CREATE AGGREGATE lcp(TEXT) (SFUNC = lcp_iterate, STYPE = TEXT);
EXCEPTION
    when sqlstate '42723' then null;
END $$;

But it returns:

row  output
1    'A'

It would be extremely simple in any programming language, but I'm struggling to find a way to do it in Postgres.

PS: I'm trying to keep my code base as readable as possible so I'd like to avoid nesting 20 CASE items. (Since I can't have more than 20 '#'.)

Basic query to try:

WITH source AS (
  SELECT 'ABC#DEF#123' AS value
  UNION ALL 
  SELECT 'ABC#DEF#456'
)

SELECT my_func(*) 
FROM source
GROUP BY value

Solution

Assuming id is the PRIMARY KEY, and value is defined NOT NULL, or you need to declare how to deal with null values (and do more).

`RECURSIVE CTE`

WITH RECURSIVE cte AS (
   (
   SELECT id, string_to_array(value, '#') AS arr, 0 AS idx
   FROM   tbl
   ORDER  BY id
   LIMIT  1
   )

   UNION ALL
   SELECT c.id, c.arr, c.idx + 1
   FROM   cte c
   WHERE  c.idx < cardinality(c.arr)  -- end of string
   AND    NOT EXISTS (                -- mismatch
      SELECT FROM tbl t
      WHERE  t.id > c.id
      AND    t.value !~ ('^' || array_to_string(c.arr[1:idx+1], '#') || '(#|$)')  -- trailing '#' or end
      )
   )
SELECT array_to_string(arr[1:idx], '#') AS result
FROM   cte
ORDER  BY idx DESC
LIMIT  1;

fiddle

Pick the first row.
Then for every substring in value check if there is any other row that disagrees. Keep looping till the first mismatch or end of string.
The last iteration yields the longest common prefix. (For a single row it's the whole value.)

I formulated the predicate with a left-anchored negated regex match (!~) to employ an index (!!!) - which is essential to make this scale. There are various candidates: a trigram GIN index, or a plain B-tree index with COLLATE "C". See:

PostgreSQL LIKE query performance variations

Or a more specialized (properly customized!) text-search index. See:

Get partial match from GIN indexed TSVECTOR column

In recent versions an SP-GiST index should work very efficiently too.

The beauty:

Can use an index, so the the size of the table matters little. I.e it scales excellently.
Stops work as soon as the first mismatch is found, and does not look at the rest, however long that may be.
Works for all corner cases.

The beast:

Added complexity from the rCTE.
The overhead won't pay for small tables with short strings. Test to find your sweat spot.

`string_to_table()` and count

SELECT string_agg(token, '#') AS result
FROM  (
   SELECT token, idx
        , row_number() OVER (ORDER BY idx) AS rn
   FROM   tbl, string_to_table(value, '#') WITH ORDINALITY AS a(token, idx)
   GROUP  BY token, idx
   HAVING count(token) = (SELECT count(*) FROM tbl)
   ORDER  BY idx
   ) sub
WHERE  idx = rn;

fiddle

This uses the same basic approach as JNevill's query. Fixes a corner case error and should be faster.

Unnest tokens into separate rows, remembering their position (idx). See:

PostgreSQL unnest() with element number

Aggregate tokens per position and only keep those that reach the total count. At the same time, take note of the row_number() for each qualifying position (rn).
Concatenate substrings after excluding positions past the first gap (idx > rn - see demo in fiddle!) to arrive at the longest prefix.

Should be faster than the first query for small tables and short strings. But scales with the number of rows x the number of positions in the string, i.e. poorly.

Find the longest prefix for a table of tokenized strings

RECURSIVE CTE

string_to_table() and count

`RECURSIVE CTE`

`string_to_table()` and count