Search code examples
rstringtextsplitmatch

Iterative counting of string matches across columns R


I have a data frame with two columns of strings

x <- data.frame(a = c("HH UH D", "L EH . M IH N", "EH K . S AE M . P EL"),
                b = c("HH UH F", "L IY . V IH NG", "S AE M . P EL"))

I am trying to calculate the number of times the characters in column b, row 1 match the characters in column a, row 1. Then column b, row 2 to column a, row 2, etc. Then adding this count as a new column. So the output of this calculation would be something like:

x <- data.frame(a = c("HH UH D", "L EH . M IH N", "EH K . S AE M . P EL"),
                b = c("HH UH F", "L IY . V IH NG", "S AE M . P EL"), 
                c = c(2, 2, 5)) # HH and UH match, so 2 
                                # L and IH match, so 2 
                                # S, AE, M, P, and EL all match, so 5

I have tried using something like this:

a_characters <- str_split(x$a, " ")
b_characters <- str_split(x$b, " ")
stringcounting <- data.frame()

for (letter in b_characters){
  count <- str_count(a_characters, letter)
  sum_count <- sum(count)
  stringcounting <- rbind(stringcounting, sum_count)
}

But the result here is: 1, 50, 20 rather than 2, 2, 5 (no sense as to why). I imagine something is going wrong in my for-loop and also likely in the way I've split my strings into characters, but I'm not sure what.


Solution

  • We can remove "." after splitting the string since we don't want to compare that and calculate matching strings using %in% and sum.

    mapply(function(x, y) sum(x[x != "."] %in% y[y!= "."]), 
                          a_characters, b_characters)
    #[1] 2 2 5