Could anyone (with extensive experience in regular-expression matching) please clarify for me why the following query returns (what I consider) unexpected results in Oracle 12?
select regexp_substr('My email: [email protected]', '[^@:space:]+@[^@:space:]+')
from dual;
Expected result: [email protected]
Actual result: t@t
Another example:
select regexp_substr('Beneficiary email: [email protected]', '[^@:space:]+@[^@:space:]+')
from dual;
Expected result: [email protected]
Actual result: ry1@gm
EDIT: I double-checked and this is not related to Oracle SQL, but the same behaviour applies to any regex engine. Even when simplifying the regex to [^:space:]+@[^:space:]+ the results are the same. I am curious to know why it does not match all the non-whitespace characters before and after the @ sign. And why sometimes it matches 1 character, other times 2 or 3 or more characters, but not all.
The POSIX shortcut you are trying to use is incorrect, you need square brackets around it:
SELECT REGEXP_SUBSTR('Beneficiary email: [email protected]', '[^@[:space:]]+@[^@[:space:]]+')
FROM dual;
or even simpler, assuming you only want to validate by checking for an '@' and the email address is always at the end of the string, after the last space:
WITH tbl(str) AS (
SELECT 'My email: [email protected]' FROM dual UNION ALL
SELECT 'Beneficiary email: [email protected]' FROM dual
)
SELECT REGEXP_REPLACE(str, '.* (.*@.*)', '\1')
from tbl
;
Note: REGEXP_REPLACE()
will return the original string if the match is not found, where REGEXP_SUBSTR()
will return NULL. Keep that in mind and handle no match found accordingly. Always expect the unexpected!