Search code examples
regexrustregex-lookaroundspositive-lookaheadregex-look-ahead

How to split look-ahead regex into 2 plain regexes?


I have a look-ahead regex [^a-z0-9%*][a-z0-9%]{3,}(?=[^a-z0-9%*]). In my test it extracts 4 substrings from @@||imasdk.googleapis.com/js/core/bridge*.html:

  • |imasdk
  • .googleapis
  • .com
  • /core

I need to rewrite it with 2 good-old regexes as i can't use look-aheads (not supported by regex engine). I've split it into [^a-z0-9%*][a-z0-9%]{3,} and [^a-z0-9%*] and the latter is checked for each first regex match in the substring after the match.

For some reason it extracts /bridge too as . is not listed in [^a-z0-9%*] and is found after /bridge. So how does the look-ahead works: does it have to be a full match, a substr (find result) or anything else? Does it mean every ending char is expected to be not from the set a-z0-9%* in this case?

In Rust the code looks as follows:

    lazy_static! {
        // WARNING: the original regex is `"[^a-z0-9%*][a-z0-9%]{3,}(?=[^a-z0-9%*])"` but Rust's regex
        // does not support look-around, so we have to check it programmatically for the last match
        static ref REGEX: Regex = Regex::new(r###"[^a-z0-9%*][a-z0-9%]{3,}"###).unwrap();
        static ref LOOKAHEAD_REGEX: Regex = Regex::new(r###"[^a-z0-9%*]"###).unwrap();
    }

    let pattern_lowercase = pattern.to_lowercase();
    
    let results = REGEX.find_iter(&pattern_lowercase);
    for (is_last, each_candidate) in results.identify_last() {
        let mut candidate = each_candidate.as_str();
        if !is_last {
            // have to simulate positive-ahead check programmatically
            let ending = &pattern_lowercase[each_candidate.end()..]; // substr after the match
            println!("searching in {:?}", ending);
            let lookahead_match = LOOKAHEAD_REGEX.find(ending);
            if lookahead_match.is_none() {
                // did not find anything => look-ahead is NOT positive
                println!("NO look-ahead match!");
                break;
            } else {
                println!("found look-ahead match: {:?}", lookahead_match.unwrap().as_str());
            }
        }
         ...

test output:

"|imasdk":
searching in ".googleapis.com/js/core/bridge*.html"
found look-ahead match: "."
".googleapis":
searching in ".com/js/core/bridge*.html"
found look-ahead match: "."
".com":
searching in "/js/core/bridge*.html"
found look-ahead match: "/"
"/core":
searching in "/bridge*.html"
found look-ahead match: "/"
"/bridge":
searching in "*.html"
found look-ahead match: "."

^ here you can see /bridge is found due to following . and it's incorrect.


Solution

  • Your LOOKAHEAD_REGEX looks for a character not in the range in any position after the match, but the original regex with lookahead only looks at the single character immediately after the match. This is why your code finds /bridge and regex101 doesn't: your code sees the . somewhere after the match whereas regex101 only looks at the *.

    You can fix your code by anchoring LOOKAHEAD_REGEX so that it will only look at the first character: ^[^a-z0-9%*].

    Aternatively, as suggested by @Sven Marnach, you can use a single regex matching the full expression: [^a-z0-9%*][a-z0-9%]{3,}[^a-z0-9%*], and strip the last character of the match.