I have a column in a tibble that should always have an 8-character string as its value, for example ABCDEF12
.
Unfortunately, I sometimes get values with a duplication of 2 characters in the string, like ABCDCDEF12
.
The order of the duplicated characters is not fixed, so it can be ABABCDEF12
, or ABCDEFEF12
, etc.
Do you have any suggestions for reducing these strings to 8 characters by removing the duplicated sequence?
If we take the above examples as input, we should always end up with ABCDEF12
as output.
Another important thing to know is that I work on a computer that does not have Internet access, I have tidyverse
at my disposal but I will not be able to install any additional packages.
Let's say your character vector is string
:
string <- c("ABCDEF12", "ABCDCDEF12", "ABABCDEF12", "ABCDEFEF12")
Then we can use base R strsplit
+unique
+paste0
iterating using sapply
:
> strsplit(string, "")|>
sapply(\(x) paste0(unique(x), collapse = ""))
[1] "ABCDEF12" "ABCDEF12" "ABCDEF12" "ABCDEF12"