Search code examples
regexgroovy

Regex matcher returns multiple values


I'm extracting a substring from a filename with format xxxxx_ID.extension

Example of strings that match correctly:

aaaa_bbbb_ID1.txt
xxxxxx_yy_ID2.xml
xxxx_ID3.zzz

I need the ID part. I tried with

def fileMatch = ("aaaa_bbbb_ID1.txt" =~ /(?<=_)([^_]+)(?=\.\w+$)/);
assert fileMatch.size() > 0
println fileMatch[0]

Where:

  • (?<=_) to match the last underscore
  • ([^_]+) matches the ID to be extracted (a string with no underscore inside)
  • (?=\.\w+$) to match the extension

It returns [ID1, ID1]. Here I was expecting just one result, why does it match the ID twice?

I know I could extract the first match with fileMatch[0][0] but I'm wondering if I'm doing anything wrong.

I also tried (?<=_)([^_]+)(?=\.[^.]+$) with the same result.


Solution

  • When you find a regex match with =~ operator in Groovy, you can either obtain a whole match using fileMatch[0] - if there are no capturing groups in the pattern, or a list with the whole match and "captured" substrings (if you specified capturing groups in the pattern).

    If you remove the capturing group (i.e. if you remove the capturing parentheses, ([^_]+) => [^_]+) use

    /(?<=_)[^_]+(?=\.\w+$)/
    

    You can obtain the whole match text with fileMatch[0].

    With fileMatch.size(), you check if there are those "captured" substrings with explicit capturing groups in your pattern. So, if there are capturing groups, you will be able to access them via fileMatch[0][0], fileMatch[0][1], etc.

    Note that the number of "groups" is the number of capturing groups in the pattern + 1 (a group for the entire match value).