Search code examples
javascriptregex

A regex to test whether an input contains two specified strings and the part between them does not contain any element of the set of specified strings


Given two strings (denoted by A and B) and a set N of strings, I need to write a regular expression to test whether a given input string W contains a substring S, where S is any substring that satisfies all of the following three conditions: 1. starts with A; 2. ends with B; 3. any element of N does not occur in the part between A and B (this part does not overlap with A and B).

For example, I chose "ab" as A, "bc" as B, ["a", "cb", "cd"] as N. If "ec" is the inner part, then "abecbc" is the string that satisfies all of the three conditions: if W contains such a substring, the regex must return true. My first variant is the following regex:

var T = /(?=ab.*bc)(?=(?!ab.*a.*bc))(?=(?!ab.*cb.*bc))(?=(?!ab.*cd.*bc))/;  

I chose W = S = "abecbc". This regex works as expected:

T.test("abecbc");
// true

But I am interested in the following problem: how to write a functionally equivalent regex without using the positive lookahead (?=) as the AND operator?

So my second variant is the following:

var R = /ab(?!.*?(?:a|cb|cd).*)bc/;

But R.test("abecbc") evaluates to false. So let us split R into three parts:

/ab(.*)/.test("abecbc")

returns true. Then

/(.*)bc/.test("abecbc")

returns true.

The inner part (i.e. the part between "ab" and "bc") is "ec". And

/(?!.*?(?:a|cb|cd).*)/.test("ec")

returns true, which is expected. So there must be three truths, and there are no more parts in R. Then why does

/ab(?!.*?(?:a|cb|cd).*)bc/.test("abecbc")

evaluate to false? And how to write a correct regex that solves the problem described in the first paragraph of the post without using the positive lookahead (?=) as the AND operator?

EDIT

My question is not a duplicate of this question: I need an explanation of why the particular regex (R) returns false instead of true. Another difference is that I do not need to test whether the inner part contains a specified string.


Solution

  • Your attempted regex of R = /ab(?!.*?(?:a|cb|cd).*)bc/ fails to match abecbc because a negative lookahead pattern is a zero-width assertion, so with your regex bc has to immediately follow ab. And if you try fixing it by adding .* before bc then there's no guarantee that a match of a|cb|cd occurs between ab and bc.

    You can instead capture B and what comes after it so that you can use the capture as an ending in a negative lookahead assertion to avoid a match when there's any of N between A and B:

    ab(?=.*?(bc.*))(?!.*(?:a|cb|cd).*\1).*?bc
    

    Demo: https://regex101.com/r/NqLbfV/4

    EDIT: The solution above performs a non-greedy match, but since you later indicated in the comments that you desire a greedy match, you can instead capture what comes before A and use a negative lookbehind assertion to avoid an occurrence of any of N between A and B:

    (?<=(.*))ab.*(?<!\1ab.*(?:a|cb|cd).*)bc
    

    Demo: https://regex101.com/r/7xuUNP/2

    Note that this requires that your browser supports variable-width lookbehind patterns, which is currently the case for all major modern browsers.