Search code examples
javaregexbluej

Java is ignoring regex to remove duplicate lines using BlueJ


Really green here. I am trying to get a regex that works in Notepad++ to run in Java using BlueJ, but Java seems to be ignoring it. I am using other replaceAll functions using regular expressions, and all of those are working.

I have this, but it is telling me the \s is an illegal escape character:

    itemList[i] = itemList[i].replaceAll("^(\s*\r\n){2,}", "\r\n");

I read about the Java engine and changed the \s to \s so it wasn't illegal:

    itemList[i] = itemList[i].replaceAll("^(\\s*\r\n){2,}", "\r\n");

I tried using [[:space:]] instead, however, it still doesn't do the replace function.

    itemList[i] = itemList[i].replaceAll("^([[:space:]]*\r\n){2,}", "\r\n");

This Java tool is processing hundreds of lines, and people are having issues using Notepad++ to remove the duplicate lines. I thought maybe doing it in the formatting tool would eliminate the issues. Here is an example of the text:

1.  Modification: No Error Message When SQL Server Down 

              S9# 395 


              Summary 

              No error message when the SQL Server is 
              down. 

              Workaround 

              There is currently no 
              workaround for this issue. The system will become 
              unusable if SQL server is down.

Solution

  • You need to use multiline mode, so ^ can match the beginning of any line. Otherwise it only matches the beginning of the whole string. Multiline mode is the default in most text editors, but using regexes anywhere else, you have to specify it. Just add (?m) to the beginning of the regex:

    (?m)^(\\s*\r\n){2,}
    

    If you're running Java 8, I recommend doing this instead:

    replaceAll("(?m)^(?:\\h*(\\R)){2,}", "$1")
    

    \s* is ambiguous, because it can match newlines as well as spaces; \h only matches horizontal whitespace (e.g., spaces and tabs).

    \R matches any kind of newline: \r\n, \n, \r, or several other, less common ones. The inner group, (\R), captures the last of the redundant newlines, and "$1" plugs it back in. This way, you don't get any nasty surprises if someone changes the newline format of your documents.