Search code examples
javaregex-groupmatcher

How does Java's Matcher.group (int) method avoid match the contents of sub-braces inside parentheses


I have a string like

String str = "美国临时申请No.62004615";

And a regex like

String regex = "(((美国|PCT|加拿大){0,1})([\\u4E00-\\u9FA5]{1,8})((NO.|NOS.){1})([\\d]{5,}))";

And other code is

 Pattern pattern = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);
    Matcher matcher = pattern.matcher(str);
    while (matcher.find()) {
        System.out.println("1:"+matcher.group(1)+"\n"
                +"2:"+matcher.group(2)+"\n"
                +"3:"+matcher.group(3)+"\n"
                +"4:"+matcher.group(4)+"\n"
                +"5:"+matcher.group(5)+"\n"
                +"6:"+matcher.group(6)+"\n"
                +"7:"+matcher.group(7));
    }

I know Parenthesis () are used to enable grouping of regex phrases. And group 1 is the big group.

The second group is ((美国|PCT|加拿大){0,1}) to match the "美国" or "PCT" or "加拿大".

The third group is ([\u4E00-\u9FA5]{1,8}) to match the chinese character which length is one to eight.

The fouth group is ((NO.|NOS.){1}) to match the NO. or NOS. The fifth group is ([\d]{5,}) to match the number

But the console is

1:美国临时申请No.62004615 2:美国 3:美国 4:临时申请 5:No. 6:No. 7:62004615

The group (2) is the same as group (3).The group (5) is the same as group (6)

It seems that group (3) rematches the sub-parentheses inside the parentheses again. I wonder if there is a way to match only the outermost parentheses。

The ideal result should be

1:美国临时申请No.62004615 2:美国  3:临时申请 4:No. 5:62004615

Solution

  • It sounds like you want a non-capturing group. From the Pattern documentation:

    (?:X)        X, as a non-capturing group

    So, change this:

    (美国|PCT|加拿大)
    

    to this:

    (?:美国|PCT|加拿大)
    

    … and then it will not be represented as a group at all in the Matcher.

    Some side notes:

    • {0,1} is the same as writing ?.
    • {1} does nothing and can be removed entirely.
    • [\\d] is the same as just \\d.