Search code examples
rregexstringreplace

How to remove duplicate character sequences within a string?


I have a column in a tibble that should always have an 8-character string as its value, for example ABCDEF12.

Unfortunately, I sometimes get values with a duplication of 2 characters in the string, like ABCDCDEF12.

The order of the duplicated characters is not fixed, so it can be ABABCDEF12, or ABCDEFEF12, etc.

Do you have any suggestions for reducing these strings to 8 characters by removing the duplicated sequence?

If we take the above examples as input, we should always end up with ABCDEF12 as output.

Another important thing to know is that I work on a computer that does not have Internet access, I have tidyverse at my disposal but I will not be able to install any additional packages.


Solution

  • Let's say your character vector is string:

    string <- c("ABCDEF12", "ABCDCDEF12", "ABABCDEF12", "ABCDEFEF12")
    

    Then we can use base R strsplit+unique+paste0 iterating using sapply:

    > strsplit(string, "")|>
        sapply(\(x) paste0(unique(x), collapse = ""))
    [1] "ABCDEF12" "ABCDEF12" "ABCDEF12" "ABCDEF12"