Search code examples
regexp-substr

Why is Oracle SQL function regexp_substr not returning all matching characters?


Could anyone (with extensive experience in regular-expression matching) please clarify for me why the following query returns (what I consider) unexpected results in Oracle 12?

select regexp_substr('My email: [email protected]', '[^@:space:]+@[^@:space:]+') 
from dual;

Expected result: [email protected]

Actual result: t@t

Another example:

select regexp_substr('Beneficiary email: [email protected]', '[^@:space:]+@[^@:space:]+') 
from dual;

Expected result: [email protected]

Actual result: ry1@gm

EDIT: I double-checked and this is not related to Oracle SQL, but the same behaviour applies to any regex engine. Even when simplifying the regex to [^:space:]+@[^:space:]+ the results are the same. I am curious to know why it does not match all the non-whitespace characters before and after the @ sign. And why sometimes it matches 1 character, other times 2 or 3 or more characters, but not all.


Solution

  • The POSIX shortcut you are trying to use is incorrect, you need square brackets around it:

    SELECT REGEXP_SUBSTR('Beneficiary email: [email protected]', '[^@[:space:]]+@[^@[:space:]]+') 
    FROM dual;
    

    or even simpler, assuming you only want to validate by checking for an '@' and the email address is always at the end of the string, after the last space:

    WITH tbl(str) AS (
      SELECT 'My email: [email protected]' FROM dual UNION ALL
      SELECT 'Beneficiary email: [email protected]' FROM dual
    )
    SELECT REGEXP_REPLACE(str, '.* (.*@.*)', '\1')
    from tbl
    ;
    

    Note: REGEXP_REPLACE() will return the original string if the match is not found, where REGEXP_SUBSTR() will return NULL. Keep that in mind and handle no match found accordingly. Always expect the unexpected!