Search code examples
javaregexreplaceall

Java regex replaceAll with exclude pattern


I'm trying to make search keywords bold in result titles by replacing each keyword with <b>kw</b> using replaceAll() method. Also need to ignore any special characters in keywords for highlight. This is the code I'm using but it is double replacing the bold directive in second pass. I am looking for a elegant regex solution since my alternative is becoming too big without covering all cases. For example, with this input:

addHighLight("a b", "abacus") 

...I get this result:

<<b>b</b>>a</<b>b</b>><b>b</b><<b>b</b>>a</<b>b</b>>cus

public static String addHighLight(String kw, String text) {
    String highlighted = text;
    if (kw != null && !kw.trim().isEmpty()) {
        List<String> tokens = Arrays.asList(kw.split("[^\\p{L}\\p{N}]+"));
        for(String token: tokens) {
            try {
                highlighted = highlighted.replaceAll("(?i)(" + token + ")", "<b>$1</b>");
            } catch ( Exception e) {
                e.printStackTrace();
            }
        }
    }
    return highlighted;
}

Solution

    1. Don't forget to use Pattern.quote(token) (unless non-regex-escaped kw is guaranteed)
    2. If you're bound to use replaceAll() (instead of tokenizing input into tag|text|tag|text|... and applying replace to texts only, which would've been much simpler and faster) - below code should help

    Note that it's not efficient - it matches some empty or already-highlighted spots and thus requires "curing" after substitution, but should treat XML/HTML tags (except CDATA) properly.

    Here's a "curing" function (no null checks):

    private static Pattern cureDoubleB = Pattern.compile("<b><b>([^<>]*)</b></b>");
    private static Pattern cureEmptyB = Pattern.compile("<b></b>");
    private static String cure(String input) {
        return cureEmptyB.matcher(cureDoubleB.matcher(input).replaceAll("<b>$1</b>")).replaceAll("");
    }
    

    Here's how the replaceAll line should look like:

    String txt = "[^<>" + Pattern.quote(token.substring(0, 1).toLowerCase()) + Pattern.quote(token.substring(0, 1).toUpperCase()) +"]*";
    highlighted = cure(highlighted.replaceAll("((<[^>]*>)*"+txt+")(((?i)" + Pattern.quote(token) + ")|("+txt+"))", "$1<b>$4</b>$5"));