Search code examples
javaregexstringsplit

Java RegExp Split String with saving delimiters


So, I have a simple string that looks like this:

word1 word2! word3? word4; word5, word6
word7 //new line
!word8! word9 word10 word11 word12

And my desire is to split this string with saving whitespace and new line delimiters. Right now I'm using a s.split() method with [\\s\\r\\n] expression as its argument and the output is:

[word1, word2!, word3?, word4;, word5,, word6, , word7, , !word8!, word9, word10, word11, word12]

And I'm okay with a whitespaces not being saved. But what can I do with a \n being saved just as a whitespace?

UPD: I pass this string through RabbitMQ query. In Java it will look like this:

"word1 word2! word3? word4; word5, word6\nword7\n!word8! word9 word10 word11 word12"

Solution

  • You can extract the whitespace and non-whitespace strings (and basically, tokenize the text into whitespace and non-whitespace text chunks) using the \S+|\s+ regex.

    See the Java demo:

    import java.util.*;
    import java.util.regex.*;
     
    class Ideone
    {
        public static void main (String[] args) throws java.lang.Exception
        {
            String line = "word1 word2! word3? word4; word5, word6\nword7\n!word8! word9 word10 word11 word12";
            Pattern p = Pattern.compile("\\S+|\\s+");
            Matcher m = p.matcher(line);
            List<String> res = new ArrayList<>();
            while(m.find()) {
                res.add(m.group());
            }
            System.out.println(res);
        }
    }
    

    Output:

    [word1,  , word2!,  , word3?,  , word4;,  , word5,,  , word6, 
    , word7, 
    , !word8!,  , word9,  , word10,  , word11,  , word12]
    

    where the line breaks are literal line break chars.