Search code examples
regexpostgresqlposix

Why do I get empty response for regexp_matches function while using positive lookahead (?=...)


Why the following code returns just empty brackets - {''}. How to make it return matching strings?

SELECT regexp_matches('ATGCATGCATGCCAACAACAACCTGTCAAGTGAGT','(?=..CAA)','g');

Expected output is:

regexp_matches 
----------------
{GCCAA}
{AACAA}
{AACAA}
{GTCAA}
(4 rows)

but instead it returns the following:

 regexp_matches 
----------------
 {""}
 {""}
 {""}
 {""}
(4 rows)

I actually have a bit more complicated query, which requires positive lookahead in order to cover all occurrences of patterns in the string even if they overlap.


Solution

  • Well, it's not pretty, but you can do it without regular expressions or custom functions.

    WITH data(d) as (
      SELECT * FROM (VALUES ('ATGCATGCATGCCAACAACAACCTGTCAAGTGAGT')) v
    )
    SELECT substr(d, x, 5) AS match
    FROM data
    JOIN LATERAL (SELECT generate_series(1, length(d))) g(x) ON TRUE
    WHERE substr(d, x, 5) LIKE '__CAA'
    ;
     match
    -------
     GCCAA
     AACAA
     AACAA
     GTCAA
    (4 rows)
    

    Basically, get each five letter slice of the string and see if it matches __CAA.

    You could change generate_series(1, length(d)) to generate_series(1, length(d)-4) because the last ones will never match, but you would have to remember to update this if the length of your matching string changes.