Search code examples
javaregexregexbuddy

java regex to exclude specific strings from a larger one


I have been banging my head against this for some time now: I want to capture all [a-z]+[0-9]? character sequences excluding strings such as sin|cos|tan etc. So having done my regex homework the following regex should work:

(?:(?!(sin|cos|tan)))\b[a-z]+[0-9]?

As you see I am using negative lookahead along with alternation - the \b after the non-capturing group closing parenthesis is critical to avoid matching the in of sin etc. The regex makes sense and as a matter of fact I have tried it with RegexBuddy and Java as the target implementation and get the wanted result but it doesn't work using Java Matcher and Pattern objects! Any thoughts?

cheers


Solution

  • The \b is in the wrong place. It would be looking for a word boundary that didn't have sin/cos/tan before it. But a boundary just after any of those would have a letter at the end, so it would have to be an end-of-word boundary, which is can't be if the next character is a-z.

    Also, the negative lookahead would (if it worked) exclude strings like cost, which I'm not sure you want if you're just filtering out keywords.

    I suggest:

    \b(?!sin\b|cos\b|tan\b)[a-z]+[0-9]?\b
    

    Or, more simply, you could just match \b[a-z]+[0-9]?\b and filter out the strings in the keyword list afterwards. You don't always have to do everything in regex.