Search code examples
javaregexdebugginglimit

regex seems to be off for special characters (e.g. +-.,!@#$%^&*;)


I am using regex to print out a string and adding a new line after a character limit. I don't want to split up a word if it hits the limit (start printing the word on the next line) unless a group of concatenated characters exceed the limit where then I just continue the end of the word on the next line. However when I hit special characters(e.g. +-.,!@#$%^&*;) as you'll see when I test my code below, it adds an additional character to the limit for some reason. Why is this?

My function is:

public static String limiter(String str, int lim) {
    str = str.trim().replaceAll(" +", " ");
    str = str.replaceAll("\n +", "\n");
    Matcher mtr = Pattern.compile("(.{1," + lim + "}(\\W|$))|(.{0," + lim + "})").matcher(str);
    String newStr = "";
    int ctr = 0;
    while (mtr.find()) {
        if (ctr == 0) {
            newStr += (mtr.group());
            ctr++;
        } else {
            newStr += ("\n") + (mtr.group());
        }
    }
    return newStr ;
}

So my input is: String str = " The 123456789 456789 +-.,!@#$%^&*();\\/|<>\"\' fox jumpeded over the uf\n 2 3456 green fence ";

With a character line limit of 7.

It outputs:

456789 +
-.,!@#$%
^&*();\/
|<>"

When the correct output should be:

456789
+-.,!@#
$%^&*()
;\/|<>"

My code is linked to an online compiler you can run here: https://ideone.com/9gckP1


Solution

  • You need to replace the (\W|$) with \b as your intention is to match whole words (and \b provides this functionality). Also, since you do not need trailing whitespace on newly created lines, you need to also use \s*.

    So, use

    Matcher mtr = Pattern.compile("(?U)(.{1," + lim + "}\\b\\s*)|(.{0," + lim + "})").matcher(str);
    

    See demo

    Note that (?U) is used here to "fix" the word boundary behavior to keep it in sync with \w (so that diacritics were not considered word characters).