Search code examples
javaregexregex-lookaroundslookbehind

Matches lookbehind / ahead multiple times


Code:

public static void main(String[] args) {
    String mainTag = "HI";
    String replaceTag = "667";
    String text = "92<HI=/><z==//HIb><cHIhi> ";
    System.out.println(strFormatted(mainTag, replaceTag, text));

    mainTag = "aBc";
    replaceTag = "923";
    text = "<dont replacethis>abcabc< abcabcde >";
    System.out.println(strFormatted(mainTag, replaceTag, text));
}

private static String strFormatted(String mainTag, String replaceTag, String text) {
    return text.replaceAll("(?i)(?<=<)" + mainTag + "(?=.*>)", replaceTag);
}

So, I want to replace mainTag (variable) for replaceTag (variable) only inside tags (<...>).

In the example above I want to replace the mainTag HI (case insensitive) in all occurrences inside <...> with 667, but my code only replaces the first occurrence.

Examples:

92<HI=/><z==//HIb><cHIhi> 

Expected output:

92<667=/><z==//667b><c667667> 

(mainTag = "HI", replaceTag = "667")

<dont replacethis>abcabc<abcabcde>

Expected output:

<dont replacethis>abcabc<923923de>

(mainTag = "aBc", replaceTag = "923");

Note: My code is wrong not only because he replaces only 1 time, but also because it only works if the "mainTag" succeeds the "<", in other words, the lookbehind only works for an unique situation.


Solution

  • You just need look-ahead here. The idea is to find all the mainTags, which are followed by a >, and then matching pairs of <>, and replace with replaceTag. The following regex would work:

    text.replaceAll("(?i)" + mainTag + "(?=[^<>]*>(?:[^<>]*<[^<>]*>)*[^<>]*)$", replaceTag);
    

    Explanation:

    (?i)               # Ignore Case
    mainTag            # Match mainTag
    (?=                # which is followed by
        [^<>]*         # Some 0 or more characters which are not < or >
        >              # Close the bracket (this ensures, mainTag is between closing bracket
        (?:            # Start a group (to match pair of bracket)
            [^<>]*     # non-bracket characters
            <          # Start a bracket 
            [^<>]*     # non-bracket characters
            >          # End the bracket
        )*             # Match the pair 0 or more times.
        [^<>]*         # Non-bracket characters 0 or more times.
    )
    [^<>]*)$
    

    The above regex really assumes that brackets are always balanced. For unbalanced regex, this might give unexpected results. But then regex is not really the tool for such job.

    Otherwise a regex a simple as this would also work fine:

    "(?i)" + mainTag + "(?=[^<>]*>)"
    

    that depends upon your use-case. This doesn't worry about balanced brackets. You can try the second one first, if it fits all scenario, then it's best.