Search code examples
mysqlregexrlike

MySQL RLIKE behaviour for numbered string


I am using RLIKE to find some email domains with mysql.

Here is the Query:

SELECT something 
FROM table1 
WHERE SUBSTRING_INDEX(table1.email, "@", -1) RLIKE "test1.com"|"test2.com"

This matched all the email domains with numbers in, example:

aaa@domain0.com

Any idea why?

EDIT: I also noticed that it finds email domains that has at least two successive digits.

Really strange.


Solution

  • The string supplied to RLIKE or REGEXP needs to be a quoted string, wherein the entire regular expression is single-quoted. What you have are two double-quoted strings separated by |, which is the bitwise OR operator.

    That is causing the whole expression to be evaluated as 0, and that's why the domain aaa@domain0.com is matched:

    # The unquoted | evaluates this to zero:
    mysql> SELECT "string" | "string";
    +---------------------+
    | "string" | "string" |
    +---------------------+
    |                   0 |
    +---------------------+
    
    # And zero matches domain0.com
    mysql> SELECT 'domain0.com' RLIKE '0';
    +-------------------------+
    | 'domain0.com' RLIKE '0' |
    +-------------------------+
    |                       1 |
    +-------------------------+
    

    Instead, you would need to use RLIKE with a single-quoted string, and backslash-escape the .. I'm also adding ^$ anchors so substrings are not matched.

    WHERE SUBSTRING_INDEX(table1.email, "@", -1) RLIKE '^test1\.com$|^test2\.com$'
    

    It could also be expressed as '^(test1\.com|test2\.com)$'. The trick is that | has very low precedence so you need to ensure both ends are anchored for every possible string you want to match.

    However, if you are just trying to match a list of domains, it is far easier to do it with IN () so you may merely list them:

    WHERE SUBSTRING_INDEX(table1.email, "@", -1) IN ('test1.com', 'test2.com', 'test4.org')