Search code examples
javaregexsplitcapturing-group

Capturing groups and Pattern split method in regular expression


How can I understand the output of the below code? The code's first four print statements are about the Capturing Groups in Regular Expression in Java and the rest of the code is about the Pattern split method. I referred a few documents to perceive the code's output (shown in the pic) but could not figured it out how exactly it's working and showing this output.

Java Code

    import java.util.*;
    import java.util.regex.*;
    import java.lang.*;
    import java.io.*;

    /* Name of the class has to be "Main" only if the class is public. */
    public class Codechef
    {
        public static void main(String[] args) {
            //Capturing Group in Regular Expression
            System.out.println(Pattern.matches("(\\w\\d)\\1", "a2a2")); //true
            System.out.println(Pattern.matches("(\\w\\d)\\1", "a2b2")); //false
            System.out.println(Pattern.matches("(AB)(B\\d)\\2\\1", "ABB2B2AB")); //true
            System.out.println(Pattern.matches("(AB)(B\\d)\\2\\1", "ABB2B3AB")); //false
            // using pattern split method
            Pattern pattern = Pattern.compile("\\W");
            String[] words = pattern.split("one@two#three:four$five");
            System.out.println(words);
            for (String s : words) {
                System.out.println("Split using Pattern.split(): " + s);
            }

        }
    }

Results

enter image description here

Edit-1

Queries

  • If I talk about Capturing Groups, I cannot figure out what’s use of ‘\1’ or ‘\2’ here? How these are evaluating to true or false.
  • If I talk about Pattern split method, I wish to know how the string split is happening. How does this split method work differently than a normal string split method?

Solution

  • The first console print lines...

    System.out.println(Pattern.matches("(\\w\\d)\\1", "a2a2")); //true
    System.out.println(Pattern.matches("(\\w\\d)\\1", "a2b2")); //false
    System.out.println(Pattern.matches("(AB)(B\\d)\\2\\1", "ABB2B2AB")); //true
    System.out.println(Pattern.matches("(AB)(B\\d)\\2\\1", "ABB2B3AB")); //false
    

    utilizes the matches() method which always returns a boolean (true or false). This method is mostly used for String validation of one sort or another. Taking the first and second example regular expressions which both are: "(\\w\\d)\\1" and then work that expression against the two supplied strings ("a2a2" and "a2b2") though the matches() method as they have done you will definitely be returned a boolean true and a false in that order.

    The real key here is knowing what that particular Regular Expression is suppose to validate. The expression above is only working against 1 Capturing Group which is denoted by the parentheses. The \\w is used for matching any single word character which is equal to a-z or A-Z or 0-9 and _ (the underscore character). The \\d is used for matching a single digit equal to any number from 0 to 9.

    Note: In reality the expression Meta characters are written as \w and \d but because the Escape Character (\) in Java Strings need to be escaped you have to add an additional Escape Character.

    The \1 is used to see if there is a single match of the same text as most recently matched by the 1st capturing group. Since there is only one capturing group specified you can only use a value of 1 here. Well, that's not entirely true, you could use the value of 0 here but then your not looking for a match in any capturing group which eliminates the purpose here. Any other value greater than 1 would create a expression exception since you have only 1 Capturing Group.

    Bottom line, The expression looks at the first two characters within the supplied string:

    • Is the first character (\\w) within the supplied string a upper or lower case A to Z or _ or a number from 0 to 9? If it isn't then there is no match and boolean false is returned but, if there is then.....
    • Is the second character (\\d) within the supplied string a digit from 0 to 9? If it isn't then boolean false is returned but, if there is then....
    • Are the remaining 2 characters exactly the same (including letter case if a-z or A-Z are used). If the remaining 2 characters are not identical or there are more than two remaining characters then boolean false is returned. If however those two remaining characters are identical then return boolean true.

    Basically, the expression is merely used to validate that the Last Two characters within the supplied String match the First Two characters of the same supplied String. This is why the second console print:

    System.out.println(Pattern.matches("(\\w\\d)\\1", "a2b2")); //false
    

    returns a boolean false, b2 is not the same as a2 whereas in the first console print:

    System.out.println(Pattern.matches("(\\w\\d)\\1", "a2a2")); //true
    

    the Last Two characters a2 do indeed match the First Two characters a2 and therefore boolean true is returned.

    You will now notice that in the other two console prints:

    System.out.println(Pattern.matches("(AB)(B\\d)\\2\\1", "ABB2B2AB")); //true
    System.out.println(Pattern.matches("(AB)(B\\d)\\2\\1", "ABB2B3AB")); //false
    

    the Regular Expression used contains 2 Capture Groups (two sets of parentheses). The same sort of matching applies here but against two capture groups instead of one like the first two console prints.

    If you want to see how these Regular Expressions play out and get explanations on what the expressions mean then use Regular Expression Tester at regex101.com. This is also a good Regular Expressions resource.

    Pattern.split():

    In this case, the use of the Pattern.split() method is a little overkill in my opinion since String.split() accepts Regular Expressions but does have it's purpose in other areas. Never the less it is a good example of how it can be used. The .split() method is used here to carry out the grouping based on the String that was supplied to it and what was deemed as the Regular Expression through Pattern which in this case is "\\W" (otherwise: \W). The \W (uppercase W) means 'match any non-word character which is not equal to a-z or A-Z or 0-9 or _. This expression is basically the opposite of "\w" (with the lowercase w). The characters @, #, :, and $ contained within the supplied String (yes... the comma, semicolon, exclamation, etc):

    "one@two#three:four$five"
    

    are considered non-word characters and therefore the split is carried out on any one of them resulting in a String Array containing:

    [one, two, three, four, five]
    

    The very same thing can be accomplished doing it this way using the String.split() method since tis method allows for a Regular Expression to be applied:

    String[] s = "one@two#three;four$five".split("\\W");
    

    or even:

    String[] s = "one@two#three;four$five".split("[@#:$]");
    

    or even:

    String[] s = "one@two#three;four$five".split("@|#|:|\\$");
    // The $ character is a reserved RegEx symbol and therefore
    // needs to be escaped.
    

    or on and on and on...

    Yup... "\\W" is easier since it covers all non-word characters. ;)