Search code examples
javaregexocpjp

How does string.split("\\S") work


I was doing a question out of the book oracle_certified_professional_java_se_7_programmer_exams_1z0-804_and_1z0-805 by Ganesh and Sharma.

One question is:

  1. Consider the following program and predict the output:

      class Test {
    
        public static void main(String args[]) {
          String test = "I am preparing for OCPJP";
          String[] tokens = test.split("\\S");
          System.out.println(tokens.length);
        }
      }
    

    a) 0

    b) 5

    c) 12

    d) 16

Now I understand that \S is a regex means treat non-space chars as the delimiters. But I was puzzled as to how the regex expression does its matching and what are the actual tokens produced by split.

I added code to print out the tokens as follows

for (String str: tokens){
  System.out.println("<" + str + ">");
}

and I got the following output

16

<>

< >

<>

< >

<>

<>

<>

<>

<>

<>

<>

<>

< >

<>

<>

< >

So a lot of empty string tokens. I just do not understand this.

I would have thought along the lines that if delimiters are non space chars that in the above text then all alphabetic chars serve as delimiters so maybe there should be 21 tokens if we are matching tokens that result in empty strings too. I just don't understand how Java's regex engine is working this out. Are there any regex gurus out there who can shed light on this code for me?


Solution

  • First things start with \s (lower case), which is a regular expression character class for white space, that is space ' ' tabs '\t', new line chars '\n' and '\r', vertical tab '\v' and a bunch of other characters.

    \S (upper case) is the opposite of this, so that would mean any non white space character.

    So when you split this String "I am preparing for OCPJP" using \S you are effectively splitting the string at every letter. The reason your token array has a length of 16.

    Now as for why these are empty.

    Consider the following String: Hello,World, if we were to split that using ,, we would end up with a String array of length 2, with the following contents: Hello and World. Notice that the , is not in either of the Strings, it has be erased.

    The same thing has happened with the I am preparing for OCPJP String, it has been split, and the points matched by your regex are not in any of the returned values. And because most of the letters in that String are followed by another letter, you end up with a load of Strings of length zero, only the white space characters are preserved.