regexstring

Regex to match string containing two names in any order


I need logical AND in regex.

something like

jack AND james

agree with following strings

  • 'hi jack here is james'

  • 'hi james here is jack'


Solution

  • You can do checks using positive lookaheads. Here is a summary from the indispensable regular-expressions.info:

    Lookahead and lookbehind, collectively called “lookaround”, are zero-length assertions...lookaround actually matches characters, but then gives up the match, returning only the result: match or no match. That is why they are called “assertions”. They do not consume characters in the string, but only assert whether a match is possible or not.

    It then goes on to explain that positive lookaheads are used to assert that what follows matches a certain expression without taking up characters in that matching expression.

    So here is an expression using two subsequent postive lookaheads to assert that the phrase matches jack and james in either order:

    ^(?=.*\bjack\b)(?=.*\bjames\b).*$
    

    Test it.

    The expressions in parentheses starting with ?= are the positive lookaheads. I'll break down the pattern:

    1. ^ asserts the start of the expression to be matched.
    2. (?=.*\bjack\b) is the first positive lookahead saying that what follows must match .*\bjack\b.
    3. .* means any character zero or more times.
    4. \b means any word boundary (white space, start of expression, end of expression, etc.).
    5. jack is literally those four characters in a row (the same for james in the next positive lookahead).
    6. $ asserts the end of the expression to me matched.

    So the first lookahead says "what follows (and is not itself a lookahead or lookbehind) must be an expression that starts with zero or more of any characters followed by a word boundary and then jack and another word boundary," and the second look ahead says "what follows must be an expression that starts with zero or more of any characters followed by a word boundary and then james and another word boundary." After the two lookaheads is .* which simply matches any characters zero or more times and $ which matches the end of the expression.

    "start with anything then jack or james then end with anything" satisfies the first lookahead because there are a number of characters then the word jack, and it satisfies the second lookahead because there are a number of characters (which just so happens to include jack, but that is not necessary to satisfy the second lookahead) then the word james. Neither lookahead asserts the end of the expression, so the .* that follows can go beyond what satisfies the lookaheads, such as "then end with anything".

    I think you get the idea, but just to be absolutely clear, here is with jack and james reversed, i.e. "start with anything then james or jack then end with anything"; it satisfies the first lookahead because there are a number of characters then the word james, and it satisfies the second lookahead because there are a number of characters (which just so happens to include james, but that is not necessary to satisfy the second lookahead) then the word jack. As before, neither lookahead asserts the end of the expression, so the .* that follows can go beyond what satisfies the lookaheads, such as "then end with anything".

    This approach has the advantage that you can easily specify multiple conditions.

    ^(?=.*\bjack\b)(?=.*\bjames\b)(?=.*\bjason\b)(?=.*\bjules\b).*$