I have a table (n < 100) with '#'-separated values in Postgres 13+. Sample:
id value
1 'ABC#DEF#123#BAIT'
2 'ABC#DEF#156#BAIT'
3 'ABC#DEF#178#BAIT'
...
These values can be of any length (e.g. 'ABC#DEFDEF#GHIG#123123#456#678').
There are no NULL values.
I need the longest prefix. Entire tokens only, no partial match per token. Matching from the start of the string, and matching in all rows.
Result for the above sample would be:
id output
1 'ABC#DEF' -- or 'ABC#DEF#'
Another example:
id value
1 'ABC#123'
2 'ADE#456'
Should return:
id output
1 ''
Since the tokens 'ABC' and 'ADE' do not match completely, and should not return 'A'.
Currently I'm using:
CREATE OR REPLACE FUNCTION lcp_iterate(_state TEXT, value TEXT)
RETURNS TEXT
AS
$$
SELECT SUBSTRING($2, 1, s - 1)
FROM generate_series(1, LEAST(LENGTH($1), LENGTH($2))) s
WHERE SUBSTRING($1, 1, s) <> SUBSTRING($2, 1, s)
UNION ALL
SELECT LEAST($1, $2)
LIMIT 1;
$$
LANGUAGE 'sql';
DO $$ BEGIN
CREATE AGGREGATE lcp(TEXT) (SFUNC = lcp_iterate, STYPE = TEXT);
EXCEPTION
when sqlstate '42723' then null;
END $$;
But it returns:
row output
1 'A'
It would be extremely simple in any programming language, but I'm struggling to find a way to do it in Postgres.
PS: I'm trying to keep my code base as readable as possible so I'd like to avoid nesting 20 CASE
items. (Since I can't have more than 20 '#'.)
Basic query to try:
WITH source AS (
SELECT 'ABC#DEF#123' AS value
UNION ALL
SELECT 'ABC#DEF#456'
)
SELECT my_func(*)
FROM source
GROUP BY value
Assuming id
is the PRIMARY KEY
, and value
is defined NOT NULL
, or you need to declare how to deal with null
values (and do more).
RECURSIVE CTE
WITH RECURSIVE cte AS (
(
SELECT id, string_to_array(value, '#') AS arr, 0 AS idx
FROM tbl
ORDER BY id
LIMIT 1
)
UNION ALL
SELECT c.id, c.arr, c.idx + 1
FROM cte c
WHERE c.idx < cardinality(c.arr) -- end of string
AND NOT EXISTS ( -- mismatch
SELECT FROM tbl t
WHERE t.id > c.id
AND t.value !~ ('^' || array_to_string(c.arr[1:idx+1], '#') || '(#|$)') -- trailing '#' or end
)
)
SELECT array_to_string(arr[1:idx], '#') AS result
FROM cte
ORDER BY idx DESC
LIMIT 1;
Pick the first row.
Then for every substring in value
check if there is any other row that disagrees. Keep looping till the first mismatch or end of string.
The last iteration yields the longest common prefix. (For a single row it's the whole value.)
I formulated the predicate with a left-anchored negated regex match (!~
) to employ an index (!!!) - which is essential to make this scale. There are various candidates: a trigram GIN index, or a plain B-tree index with COLLATE "C"
. See:
Or a more specialized (properly customized!) text-search index. See:
In recent versions an SP-GiST index should work very efficiently too.
The beauty:
The beast:
string_to_table()
and countSELECT string_agg(token, '#') AS result
FROM (
SELECT token, idx
, row_number() OVER (ORDER BY idx) AS rn
FROM tbl, string_to_table(value, '#') WITH ORDINALITY AS a(token, idx)
GROUP BY token, idx
HAVING count(token) = (SELECT count(*) FROM tbl)
ORDER BY idx
) sub
WHERE idx = rn;
This uses the same basic approach as JNevill's query. Fixes a corner case error and should be faster.
Unnest tokens into separate rows, remembering their position (idx
). See:
Aggregate tokens per position and only keep those that reach the total count. At the same time, take note of the row_number()
for each qualifying position (rn
).
Concatenate substrings after excluding positions past the first gap (idx
> rn
- see demo in fiddle!) to arrive at the longest prefix.
Should be faster than the first query for small tables and short strings. But scales with the number of rows x the number of positions in the string, i.e. poorly.
Related: