Search code examples
javaregexpentaho

Regex - Ignore part of the string


I am working on Pentaho which uses Java regex package : java.util.regex.

I want to extract a lot of information from the lines of a text file from both start and end of the string :

^StartofString Controls\(param1="(D[0-9]{0,})",param2="(G[0-9]{0,})",param3="([^"]{0,})",param4="([^"]{0,})"\):(?:.*)param5="([^"]{0,})",.*

There is a long part of the string I want to ignore and try to do so with (?:.*)

The positive lookahead seems working when I test the Regex on the step but does not work when I execute the transformation. I test the string on 'Regex Evaluation' step, check with 'Filter rows' the boolean of previous step and extract groups within a Javascript step :

var pattern = Packages.java.util.regex.Pattern.compile(patternStr);
var matcher = pattern.matcher(content.toString());
var matchFound = matcher.find();

with patterStr being the same regex than the one in the 'Regex Evaluation' step but with escaping characters : \

I have read many questions about ignoring parts of strings in regex and still can't find the answer. Any help is welcome. I can provide more informations if needed.


Solution

  • A non capturing group doesn't mean that its content won't be captured, it means that it won't be captured in a group (although you're still grouping tokens in your regex, which can be useful to apply a modifier to them at once).

    For example, these regex will all match the exact same abc string :

    abc
    a(?:b)c
    a(b)c
    

    However in the third case, you've defined a capturing group which will enable you to access to b independently. The first two cases are equals in all respects.

    The non-capturing group becomes useful when you want to apply a modifier to a group of tokens without having an extra group you can reference later. The following regexs will all match the same strings :

    (ab)*(c)\2
    (?:ab)*(c)\1
    

    We want to apply * to the ab tokens. Either we do it with a capturing group (first example) and a group is created that we can reference, or we use a non-capturing group. The backreference at the end of the regex is supposed to match the c ; in the first example it's the second group since ab is the first one, while in the second c is the first group that can be referenced.

    Now that I've explained what non-capturing groups do, let's solve your problem : you want to remove something from the middle of your string, where you know what's at the beginning and what's at the end.

    Let's assume the string you want to match is the following :

    Aremove-thisB
    

    And that you want the result AB.

    There are multiple strategies to do so, the easiest in your case probably is to match both the beginning and end of the string in their own capturing group and create your output from there :

    var pattern = Packages.java.util.regex.Pattern.compile("(A).*(B)");
    var matcher = pattern.matcher(content.toString());
    var matchFound = matcher.find();
    if (matchFound) { return matcher.group(1) + matcher.group(2); }