Search code examples
regexpcre

Match a string between two or more words regardless of order


I need a regular expression that matches words regardless of order. As an example, these lines should match with the marked range,

A longword1 B longword2 C
  ^-------------------^

A longword2 B longword1 C
  ^-------------------^

while these shouldn't:

A longword1 B longword1 C
A longword2 B longword2 C
A longword1 B
A longword2 C

(A, B, C are fillers, they can be essentially any text)

It is possible to just use alternations, such as: \b((longword1).*?(longword2)|(longword2).*?(longword2))\b. But the regex would grow factorially, i.e. three words would need 3! alternates. It's also possible to use subroutines, e.g. \b((?'A'longword1).*?(?'B'longword2')|(?P>B).*?(?P>A))\b. Although shorter, I would still need to include all of its permutations.

Now I've read this post and this other one, but the accepted answers don't exactly solve my problem. Using \b(?=.*longword1)(?=.*longword2).*\b would match the whole line instead of the range I've shown.

I understand, that it would be much easier if I checked the sentence against a list of words. But my current use case prevents it from being possible; I can only use regexes.

Here are some links to demonstrate what I meant:

EXPECTED:

INCORRECT:

Are there any simpler regex(es) to tackle this?


Solution

  • You may use a backreference + a subroutine:

    \b(longword1|longword2)\b.*?\b(?!\1\b)(?1)\b
    

    Expanding it for three alternatives:

    \b(longword1|longword2|longword3)\b.*?\b(?!\1\b)((?1))\b.*?\b(?!(?:\1|\2)\b)(?1)\b
    

    See the regex demo and this regex demo, too. So, the list of words will be in Group 1, and you will only need to add backreferences before the subsequent subroutines.

    Details

    • \b(longword1|longword2)\b - a whole word, either longword1 or longword2
    • .*? - any 0 or more chars other than line break chars, as few as possible
    • \b - a word boundary
    • (?!\1\b) - there cannot be the same text as matched in Group 1 followed with a word boundary
    • (?1) - a subroutine that matches the same pattern as in Group 1
    • \b - a word boundary