Search code examples
javaregexjsoup

System.out.print removes Matcher class exception


So I was using Jsoup to crawl some web pages and this wired issue happens.

With the regex expression of

// Sets the prefix for all pages to prevent navigate to unwanted pages.
String prefix = "https://handbook.unimelb.edu.au/%d/subjects";
// Postfix for search page
String searchPostfix = "(\\?page=\\d+)?$";
// Postfix for subject page
String subjectPostfix = "\\/(\\w+)(\\/.+)?$";

String root = String.format(prefix, "2019");
String pattern = root.replace("/", "\\/").replace(".", "\\.");
Pattern reg1 = Pattern.compile("^" + pattern + searchPostfix);
Pattern reg2 = Pattern.compile("^" + pattern + subjectPostfix);

With these regex patterns. I ran it with string

String s1 = "https://handbook.unimelb.edu.au/2019/subjects/undergraduate";

And with a method:

private String getSubjectCode(String link) {
    System.out.println(link);
    if (isSubjectPage(link)) {
        Matcher subjectMatcher = subjectPattern.matcher(link);
        System.out.println(link);
        // System.out.println(subjectMatcher.matches());   ## Exception if commented
        System.out.println(subjectMatcher.group(0));
        System.out.println(subjectMatcher.group(1));


        return subjectMatcher.group(1);
    }
    return null;
}

What will happen is, if I left that commented line uncommented, the program ran well.

However, if I comment that line

Exception in thread "main" java.lang.IllegalStateException: No match found
    at java.base/java.util.regex.Matcher.group(Matcher.java:645)
    at Page.Pages.getSubjectCode(Pages.java:54)
    at Page.Pages.enqueue(Pages.java:85)
    at Crawler.Crawler.parsePage(Crawler.java:41)
    at Crawler.Crawler.crawl(Crawler.java:51)
    at Main.main(Main.java:9)

The above Exception will be raised, why will a print line affect how the program is running?

Also, without comment

System.out.println(subjectMatcher.matches());   // Exception if commented
// out -> true

Solution

  • It is not System.out.println that causes the difference, but a side-effect of calling the method matches().

    This is explained in the JavaDocs of Matcher

    A matcher is created from a pattern by invoking the pattern's matcher method. Once created, a matcher can be used to perform three different kinds of match operations:

    • The matches method attempts to match the entire input sequence against the pattern.
    • The lookingAt method attempts to match the input sequence, starting at the beginning, against the pattern.
    • The find method scans the input sequence looking for the next subsequence that matches the pattern.

    And

    The explicit state of a matcher is initially undefined; attempting to query any part of it before a successful match will cause an IllegalStateException to be thrown. The explicit state of a matcher is recomputed by every match operation.

    You need to call either matches, lookingAt or find before you can perform further queries such as group(0).