Search code examples
rstringintersection

Extract characters that repeats in same position in vector


How can I get the characters that repeat at the same position in a string?

df <- data.frame(col1 = paste0("a", LETTERS[1:5]),
                 col2 = paste0("b", letters[1:5]),
                 col3 = paste0("ccde", letters[1:5]),
                 col4 = paste0('?', letters[1:2], 1:5),
                 col5 = paste0(1:5, 'hello you', letters[1:2], 1:5),
                 col6 = paste0('hello', letters[1:2], 1:5, "you"),
                 col7 = c("hello1 you", "hello you2", "hello3 you", "hello you4", "hello5 you"))

#   col1 col2  col3 col4         col5       col6       col7
# 1   aA   ba ccdea  ?a1 1hello youa1 helloa1you hello1 you
# 2   aB   bb ccdeb  ?b2 2hello youb2 hellob2you hello you2
# 3   aC   bc ccdec  ?a3 3hello youa3 helloa3you hello3 you
# 4   aD   bd ccded  ?b4 4hello youb4 hellob4you hello you4
# 5   aE   be ccdee  ?a5 5hello youa5 helloa5you hello5 you

result <- c("a", "b", "ccde", "?", "hello you", "helloyou", "hello")

Solution

  • Here is one possible approach. Convert the strings to raw, test the first against the rest for equality, reduce the result, and use it to subset the first before reconverting it back to character. You could do the same approach with strsplit() but I believe that would be slightly less efficient.

    f <- function(x) {
      cv <- charToRaw(x[1])
      rawToChar(cv[Reduce(`&`, lapply(x[-1], \(y) cv == charToRaw(y)))])
    }
    
    sapply(df, f2) 
    #  col1        col2        col3        col4        col5        col6        col7 
    #   "a"         "b"      "ccde"         "?" "hello you"  "helloyou"     "hello"