Search code examples
regexknime

Regex with multiple groups, some of which are optional


I have trouble matching multiple groups, some of which are optional. I've tried variations of greedy/non greedy, but can't get it to work.

As input, I have cells which look like this:

SEPA Overboeking                 IBAN: AB1234        BIC: LALA678                    Naam: John Smith            Omschrijving: Hello hello        Kenmerk: 03-05-2019 23:12 533238

I wanna split these up into groups of IBAN, BIC, Naam, Omschrijving, Kenmerk.

For this example, this yields: AB1234; LALA678; John Smith; Hello hello; 03-05-2019 23:12 533238. To obtain this, I've used:

.*IBAN: (.*)\s+BIC: (.*)\s+Naam: (.*)\s+Omschrijving: (.*)\s+Kenmerk: (.*)

This works perfectly as long as all these groups are present in the input. Some cells, however don't have the "Omschrijving" and/or "Kenmerk" part. As output, I would like to have empty groups if they're not present. Right now, nothing is matched. I've tried variations with greedy/non greedy, but couldn't get it to work.

Help would be greatly appreciated!

N.B.: I'm working in KNIME (open source data analysis tool)


Solution

  • I was able to split your input using the following regular expression:

    ^.*
    \s+IBAN\:\s*(?<IBAN>.*?)
    \s+BIC\:\s*(?<BIC>.*?)
    \s+Naam\:\s*(?<Naam>.*?)
    (?:\s+Omschrijving\:\s*(?<Omschrijving>.*?))?
    (?:\s+Kenmerk\:\s*(?<Kenmerk>.*?))?
    $
    

    This requires your fields to follow the given order and will treat the fields IBAN, BIC and Naam as required. Fields Omschrijving and Kenmerk may be optional. I am pretty sure, this can still be optimized, but it results in the following output, which should be fine for you (or at least a starting point):

    Example output results

    For evaluation and testing in KNIME, I used Palladian's Regex Extractor node, that can be configured as follows and provides a nice preview functionality:

    Regex Extractor configuration

    I added an example workflow to my NodePit Space. It contains some example lines, parses them and provides the above seen output.