Search code examples
javaregexbackreferencecapture-grouprecursive-regex

Recursive group capturing regex with backreference in JAVA


I am trying to capture multiple groups recursively in a string using also a backreference to a group within the regex. Even though I am using a Pattern and a Matcher and a "while(matcher.find())" loop, it is still only capturing the last instance instead of all the instances. In my case the only possible tags are <sm>,<po>,<pof>,<pos>,<poi>,<pol>,<poif>,<poil>. Since these are formatting tags, I need to capture:

  1. any text outside of a tag (so that I can format it as "normal" text, and I am going about this by capturing any text before a tag in one group while I capture the tag itself in another group, and as I iterate through the occurrences I remove everything that has been captured from the original String; if I have any text left over in the end I format that as "normal" text)
  2. the "name" of the tag so that I know how I will have to format the text inside the tag
  3. the text contents of the tag that will be formatted accordingly to the tag name and its associated rules

Here is my sample code:

        String currentText = "the man said:<pof>“This one, at last, is bone of my bones</pof><poi>and flesh of my flesh;</poi><po>This one shall be called ‘woman,’</po><poil>for out of man this one has been taken.”</poil>";
        String remainingText = currentText;

        //first check if our string even has any kind of xml tag, because if not we will just format the whole string as "normal" text
        if(currentText.matches("(?su).*<[/]{0,1}(?:sm|po)[f|l|s|i|3]{0,1}[f|l]{0,1}>.*"))
        {                
            //an opening or closing tag has been found, so let us start our pattern captures
            //I am using a backreference \\2 to make sure the closing tag is the same as the opening tag
            Pattern pattern1 = Pattern.compile("(.*)<((sm|po)[f|l|s|i|3]{0,1}[f|l]{0,1})>(.*?)</\\2>",Pattern.UNICODE_CHARACTER_CLASS);
            Matcher matcher1 = pattern1.matcher(currentText);                
            int iteration = 0;
            while(matcher1.find()){
                System.out.print("Iteration ");
                System.out.println(++iteration);
                System.out.println("group1:"+matcher1.group(1));
                System.out.println("group2:"+matcher1.group(2));
                System.out.println("group3:"+matcher1.group(3));
                System.out.println("group4:"+matcher1.group(4));

                if(matcher1.group(1) != null && matcher1.group(1).isEmpty() == false)
                {
                    m_xText.insertString(xTextRange, matcher1.group(1), false);
                    remainingText = remainingText.replaceFirst(matcher1.group(1), "");
                }
                if(matcher1.group(4) != null && matcher1.group(4).isEmpty() == false)
                {
                    switch (matcher1.group(2)) {
                        case "pof": [...]
                        case "pos": [...]
                        case "poif": [...]
                        case "po": [...]
                        case "poi": [...]
                        case "pol": [...]
                        case "poil": [...]
                        case "sm": [...]
                    }
                    remainingText = remainingText.replaceFirst("<"+matcher1.group(2)+">"+matcher1.group(4)+"</"+matcher1.group(2)+">", "");
                }
            }

The System.out.println is only outputting once in my console, with these results:

Iteration 1:
  group1:the man said:<pof>“This one, at last, is bone of my bones</pof><poi>and flesh of my flesh;</poi><po>This one shall be called ‘woman,’</po>; 
  group2:poil
  group3:po
  group4:for out of man this one has been taken.”

Group 3 is to be ignored, the only useful groups are 1, 2 and 4 (group 3 is part of group 2). Why is this only capturing the last tag instance "poil", while it is not capturing the preceding "pof", "poi", and "po" tags?

The output I would like to see would be like this:

Iteration 1:
  group1:the man said:
  group2:pof
  group3:po
  group4:“This one, at last, is bone of my bones

Iteration 2:
  group1:
  group2:poi
  group3:po
  group4:and flesh of my flesh;

Iteration 3:
  group1:
  group2:po
  group3:po
  group4:This one shall be called ‘woman,’

Iteration 3:
  group1:
  group2:poil
  group3:po
  group4:for out of man this one has been taken.”

Solution

  • I just found the answer to this problem, it simply needed a non-greedy quantifier in the first capture, just like I had in the fourth capture group. This is working exactly as needed:

    Pattern pattern1 = Pattern.compile("(.*?)<((sm|po)[f|l|s|i|3]{0,1}[f|l]{0,1})>(.*?)</\\2>",Pattern.UNICODE_CHARACTER_CLASS);