Search code examples
regexunique-values

What regular expression can remove duplicate items from a string?


Given a string of identifiers separated by :, is it possible to construct a regular expression to extract the unique identifiers into another string, also separated by :?

How is it possible to achieve this using a regular expression? I have tried s/(:[^:])(.*)\1/$1$2/g with no luck, because the (.*) is greedy and skips to the last match of $1.

Example: a:b:c:d:c:c:x:c:c:e:e:f should give a:b:c:d:x:e:f

Note: I am coding in perl, but I would very much appreciate using a regex for this.


Solution

  • In .NET which supports infinite repetition inside lookbehind, you could search for

    (?<=\b\1:.*)\b(\w+):?
    

    and replace all matches with the empty string.

    Perl (at least Perl 5) only supports fixed-length lookbehinds, so you can try the following (using lookahead, with a subtly different result):

    \b(\w+):(?=.*\b\1:?)
    

    If you replace that with the empty string, all previous repetitions of a duplicate entry will be removed; the last one will remain. So instead of

    a:b:c:d:x:e:f
    

    you would get

    a:b:d:x:c:e:f
    

    If that is OK, you can use

    $subject =~ s/\b(\w+):(?=.*\b\1:?)//g;
    

    Explanation:

    First regex:

    (?<=\b\1:.*): Check if you can match the contents of backreference no. 1, followed by a colon, somewhere before in the string.

    \b(\w+):?: Match an identifier (from a word boundary to the next :), optionally followed by a colon.

    Second regex:

    \b(\w+):: Match an identifier and a colon.

    (?=.*\b\1:?): Then check whether you can match the same identifier, optionally followed by a colon, somewhere ahead in the string.