Search code examples
regexknime

Regex to remove duplicate numbers from a string


I have produced a data set with codes separated by pipe symbols. I realized there are many duplicates in each row. Here are three example rows (the regex is applied to each row individually in KNIME)

0612|0613|061|0612|0612
0211|0612|021|0212|0211|0211
0111|0111
0511|0512|0511|0511|0521|0512|0511

I am trying to build a regex that removes the duplicate code numbers from each row. I tested \b(\d+)\b.*\b\1\b from a different thread here but the expression does not keep the other codes. The desired outputs for the example rows above would be

0612|0613|061
0211|0612|021|0212
0111
0511|0512|0521

Appreciate your help


Solution

  • No idea what regex engine this knime uses.

    Probably you need one that supports variable length lookbehind to do it in one pass, eg .NET

    \|(\d+)\b(?<=\b\1\b.*?\1)
    

    See .NET regex demo at Regexstorm (check [•] replace matches with, click on "context")

    Update: Turns out knime uses Java's pattern implementation...

    In Java regex variable-width lookbehind is actually implemented, but only by use of finite repitition. The second issue is, that backreference \1 can't be used inside a lookbehind. So we'd need some trickery and put it into a lookahead which we put in the lookbehind.

    Let's assume a maximum potential distance of 999 characters between duplicates and each field can contain up to 9 digits (adjust these values to your needs).

    \|(\d+)\b(?<=\b(?=\|?\1\b).{1,999}?\|\d{1,9})
    

    Java regex demo at Regex101 (explanation on right side)

    0612|0613|061
    ​0211|0612|021|0212
    ​0111
    ​0511|0512|0521


    With only a lookahead you can get unique rows too, but vice versa (not like your desired results)

    \b(\d+)\|(?=.*?\b\1\b)
    

    Another demo on Regex101

    0613|061|0612
    0612|021|0212|0211
    0111
    0521|0512|0511


    For further information have a look into the Stackoverflow Regex FAQ.