I have produced a data set with codes separated by pipe symbols. I realized there are many duplicates in each row. Here are three example rows (the regex is applied to each row individually in KNIME)
0612|0613|061|0612|0612
0211|0612|021|0212|0211|0211
0111|0111
0511|0512|0511|0511|0521|0512|0511
I am trying to build a regex that removes the duplicate code numbers from each row.
I tested \b(\d+)\b.*\b\1\b
from a different thread here but the expression does not keep the other codes. The desired outputs for the example rows above would be
0612|0613|061
0211|0612|021|0212
0111
0511|0512|0521
Appreciate your help
No idea what regex engine this knime uses.
Probably you need one that supports variable length lookbehind to do it in one pass, eg .NET
\|(\d+)\b(?<=\b\1\b.*?\1)
See .NET regex demo at Regexstorm (check [•] replace matches with, click on "context")
Update: Turns out knime uses Java's pattern implementation...
In Java regex variable-width lookbehind is actually implemented, but only by use of finite repitition. The second issue is, that backreference \1
can't be used inside a lookbehind. So we'd need some trickery and put it into a lookahead which we put in the lookbehind.
Let's assume a maximum potential distance of 999 characters between duplicates and each field can contain up to 9 digits (adjust these values to your needs).
\|(\d+)\b(?<=\b(?=\|?\1\b).{1,999}?\|\d{1,9})
Java regex demo at Regex101 (explanation on right side)
0612|0613|061
0211|0612|021|0212
0111
0511|0512|0521
With only a lookahead you can get unique rows too, but vice versa (not like your desired results)
\b(\d+)\|(?=.*?\b\1\b)
0613|061|0612
0612|021|0212|0211
0111
0521|0512|0511
For further information have a look into the Stackoverflow Regex FAQ.