Search code examples
javaregexword-boundary

Java Regex Ordering with \b


I am facing a weird problem with Java Regex when combined with the word-boundary \b. Read through Oracle - RegexBounds and RegularExpressions - WordBoundaries

Below is my regex (Java String) (for a email address)

"\\b[A-Z0-9._!#$%&'*+-/=?^`{}|~]+@([-0-9a-zA-Z]+[.])+[a-zA-Z]{2,6}$"

This regex matches the email test$@example.com but not $test@example.com.

However, when I remove \b (Java String \\b), it matches both the email. This is the case for all the special characters in the regex.

Whats happening with \b in the ordering of regex? I though that [A-Z0-9._!#$%&'*+-/=?^`{}|~]+ should match text in any order irrespective of \b

Code Snippet:

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class ValidationUtil {

    private static final String EMAIL_ADDRESS_REGEX = "\\b[A-Z0-9._!#$%&'*+-/=?^`{}|~]+@([-0-9a-zA-Z]+[.])+[a-zA-Z]{2,6}$";
    private static final Pattern EMAIL_ADDRESS_PATTERN = Pattern.compile(EMAIL_ADDRESS_REGEX, Pattern.CASE_INSENSITIVE);

    public static boolean isValidEmail(String email) {
        if (email == null) {
            return false;
        }
        Matcher matcher = EMAIL_ADDRESS_PATTERN.matcher(email);
        return matcher.matches();
    }
}

After this issue, I moved the regex validation to apache-commons EmailValidator. But still curious why this weird behavior.

I went through many stackoverflow topics on issue with \b, but couldn't find a related one.


Solution

  • To quote the page you link to:

    There are three different positions that qualify as word boundaries:

    • Before the first character in the string, if the first character is a word character.
    • ...

    The first character, $, is not a word character, so the \b isn't matched at the start of the string and therefore the entire regex isn't matched.