Search code examples
regexregex-negation

Regex to check if string contains anything other than allowed words


I would like to check if a string contains any word other than some predefined ones. The predefined words are What is,plus,minus,multiplied by,divided by, single whitespace included in some of the phrases. I've read this post and this one, both using negative lookaheads, but couldn't come up with a pattern that worked.

For example, input text "What is plus abc divided by" should come back as "abc" not recognized.

What would be a correct regex for this?

Edit:

Note that I don't care about what the invalid token is, just that it exists. It can be anything, a word or a number. The question can also be thought as "check if the input contains only allowed words".


Solution

  • Simply join them up in a group:

    (?:What is|plus|minus|multiplied by|divided by)
    

    Note that if you have, for example, multiply and multiply by (i.e. one token that starts with another), multiply by must comes first:

    (?:What is|plus|minus|multiply by|multiply)
    

    To check if the string only contains valid tokens, use:

    ^                  # Match at the start of string
    \g<token>          # a pre-defined token
    (?:\s+\g<token>)*  # followed by 0 or more tokens
    $                  # right before the end of string.
    

    ...where \g<token> denotes the expression above.

    Try it on regex101.com.

    Original answer

    Since we also need to find the (first) invalid token, you need to match every non-whitespace streaks and store those which are not matched by the expression above in a group:

    (?:What is|plus|minus|multiplied by|divided by)|(\S+)
    

    If the match contains group 1, that means it is a non-recognized token. Output an error accordingly.

    Try it on regex101.com.