Search code examples
regexgroovysplit

Groovy Regex: String Split pattern not returning same result as Matcher pattern


I'm trying to extract the data between a starting and ending markers in a string. There are multiple matches and I need to extract all the matches (into an array or list doesn't matter)

I have a limitation and cannot use Regex Matcher on my setup so as an alternative I'm looking at using string.split() with a regex.

def str = "USELESS STUFF START:M A:STUFF1 B:MORE2 C:THAT3 END:M START:M A:STUFF4 B:MORE5 C:THAT6 END:M START:M A:STUFF7 B:MORE8 C:THAT9 END:M USELESS STUFF"

This pattern works with Regex Matcher and extracts all the matches between the starting and ending marker.

def items = str =~ /(?s)(?<=START:M).*?(?=END:M)/

Result:

[ A:STUFF1 B:MORE2 C:THAT3, A:STUFF4 B:MORE5 C:THAT6, A:STUFF7 B:MORE8 C:THAT9 ]

However, when I try to use the same pattern on string.split

def items = str.split(/(?s)(?<=START:M).*?(?=END:M)/)

it returns the end and start markers themselves for each match instead of what's between them.

[USELESS STUFF START:M, END:M START:M, END:M START:M, END:M USELESS STUFF]

What am I missing, why isn't the Split pattern returning the same groups as Matcher pattern?


Solution

  • This behavior corresponds well to the method names:

    • match what text ?
    • split by what separator ?

    What Groovy does in this case is essentially pour some syntactic sugar over the standard Java APIs. The line def items = str =~ /(?s)(?<=START:M).*?(?=END:M)/ is the same as

    Matcher items = Pattern.compile("(?s)(?<=START:M).*?(?=END:M)").matcher(str);
    

    The groups found by this matcher will be

     A:STUFF1 B:MORE2 C:THAT3 
     A:STUFF4 B:MORE5 C:THAT6 
     A:STUFF7 B:MORE8 C:THAT9
    

    While the Matcher returns the matches, the Splitter, contrary, splits by them - it finds the parts of the text by the given regex and treats these as separators, cutting them out and returning what's left:

    START:M
    //  A:STUFF1 B:MORE2 C:THAT3 is cut out since it's a separator
    END:M START:M
    //  A:STUFF4 B:MORE5 C:THAT6 is a separator
    END:M START:M
    //  A:STUFF7 B:MORE8 C:THAT9 is a separator
    END:M
    

    To actually get the data between START and END marks, str.split(" END:M START:M | START:M | END:M ") would do. And the standard String methods like indexOf, lastIndexOf and substring can be very heplful to get rid of the useless stuff and get only the needed groups by simply removing all content before first START:M and after last END:M:

    str.substring(str.indexOf("START:M ") + 8, str.lastIndexOf(" END:M"))
       .split(" END:M START:M ")
    
    // or more groovy
    str[str.indexOf("START:M ") + 8 .. str.lastIndexOf(" END:M") - 1]
       .split(" END:M START:M ")
    

    (8 is the length of START:M)