Search code examples
javaregexpattern-matchingjava.util.scannerjava-9

Scanner.findAll() and Matcher.results() work differently for same input text and pattern


I have seen this interesting thing during split of properties string using regex. I am not able to find the root cause.

I have a string which contains text like properties key=value pair. I have a regex which split the string into keys/values based on the = position. It considers first = as the split point. Value can also contain = in it.

I tried using two different ways in Java to do it.

  1. using Scanner.findAll() method

    This is not behaving as expected. It should extract and print all keys based on pattern. But I found its behaving weird. I have one key-value pair as below

    SectionError.ErrorMessage=errorlevel=Warning {HelpMessage:This is very important message This is very important .....}

The key which should be extracted is SectionError.ErrorMessage= but it also considers errorlevel= as key.

The interesting point is if I remove one of characters from properties String passed, it behaves fine and only extracts SectionError.ErrorMessage= key.

  1. using Matcher.results() method

    This works fine. No problem whatever we put in the properties string.

Sample code I tried :

import java.util.Scanner;
import java.util.regex.MatchResult;
import java.util.regex.Pattern;

import static java.util.regex.Pattern.MULTILINE;

public class MessageSplitTest {

    static final Pattern pattern = Pattern.compile("^[a-zA-Z0-9._]+=", MULTILINE);

    public static void main(String[] args) {
        final String properties =
                "SectionOne.KeyOne=first value\n" + // removing one char from here would make the scanner method print expected keys
                        "SectionOne.KeyTwo=second value\n" +
                        "SectionTwo.UUIDOne=379d827d-cf54-4a41-a3f7-1ca71568a0fa\n" +
                        "SectionTwo.UUIDTwo=384eef1f-b579-4913-a40c-2ba22c96edf0\n" +
                        "SectionTwo.UUIDThree=c10f1bb7-d984-422f-81ef-254023e32e5c\n" +
                        "SectionTwo.KeyFive=hello-world-sample\n" +
                        "SectionThree.KeyOne=first value\n" +
                        "SectionThree.KeyTwo=second value additional text just to increase the length of the text in this value still not enough adding more strings here n there\n" +
                        "SectionError.ErrorMessage=errorlevel=Warning {HelpMessage:This is very important message This is very important message This is very important messageThis is very important message This is very important message This is very important message This is very important message This is very important message This is very important message This is very important message This is very important messageThis is very important message This is very important message This is very important message This is very important message This is very important message}\n" +
                        "SectionFour.KeyOne=sixth value\n" +
                        "SectionLast.KeyOne=Country";

        printKeyValuesFromPropertiesUsingScanner(properties);
        System.out.println();
        printKeyValuesFromPropertiesUsingMatcher(properties);
    }

    private static void printKeyValuesFromPropertiesUsingScanner(String properties) {
        System.out.println("===Using Scanner===");
        try (Scanner scanner = new Scanner(properties)) {
            scanner
                    .findAll(pattern)
                    .map(MatchResult::group)
                    .forEach(System.out::println);
        }
    }

    private static void printKeyValuesFromPropertiesUsingMatcher(String properties) {
        System.out.println("===Using Matcher===");
        pattern.matcher(properties).results()
                .map(MatchResult::group)
                .forEach(System.out::println);

    }
}

Output printed:

===Using Scanner===
SectionOne.KeyOne=
SectionOne.KeyTwo=
SectionTwo.UUIDOne=
SectionTwo.UUIDTwo=
SectionTwo.UUIDThree=
SectionTwo.KeyFive=
SectionThree.KeyOne=
SectionThree.KeyTwo=
SectionError.ErrorMessage=
errorlevel=
SectionFour.KeyOne=
SectionLast.KeyOne=

===Using Matcher===
SectionOne.KeyOne=
SectionOne.KeyTwo=
SectionTwo.UUIDOne=
SectionTwo.UUIDTwo=
SectionTwo.UUIDThree=
SectionTwo.KeyFive=
SectionThree.KeyOne=
SectionThree.KeyTwo=
SectionError.ErrorMessage=
SectionFour.KeyOne=
SectionLast.KeyOne=

What could be the root cause of this? Do scanner's findAll works differently than matcher?

Please let me know if any more info is required.


Solution

  • Scanner's documentation mentions the word "buffer" a lot. This suggests that Scanner does not know about the entire string from which it is reading, and only holds a small bit of it at a time in a buffer. This makes sense, because Scanners are designed to read from streams as well, reading everything from the stream might take a long time(, or forever!) and takes up a lot of memory.

    In the source code of Scanner, there is indeed a CharBuffer:

    // Internal buffer used to hold input
    private CharBuffer buf;
    

    Because of the length and contents of your string, the Scanner has decided to load everything up to...

    SectionError.ErrorMessage=errorlevel=Warning {HelpMessage:This is very...
                              ^
                        somewhere here
    (It could be anywhere in the word "errorlevel")
    

    ...into the buffer. Then, after that half of the string is read, the other half the string starts like this:

    errorlevel=Warning {HelpMessage:This is very...
    

    errorLevel= is now the start of the string, causing the pattern to match.

    Related Bug?

    Matcher doesn't use a buffer. It stores the whole string against which it is matching in the field:

    /**
     * The original string being matched.
     */
    CharSequence text;
    

    So this behaviour is not observed in Matcher.