I have a data frame with two columns of strings
x <- data.frame(a = c("HH UH D", "L EH . M IH N", "EH K . S AE M . P EL"),
b = c("HH UH F", "L IY . V IH NG", "S AE M . P EL"))
I am trying to calculate the number of times the characters in column b, row 1 match the characters in column a, row 1. Then column b, row 2 to column a, row 2, etc. Then adding this count as a new column. So the output of this calculation would be something like:
x <- data.frame(a = c("HH UH D", "L EH . M IH N", "EH K . S AE M . P EL"),
b = c("HH UH F", "L IY . V IH NG", "S AE M . P EL"),
c = c(2, 2, 5)) # HH and UH match, so 2
# L and IH match, so 2
# S, AE, M, P, and EL all match, so 5
I have tried using something like this:
a_characters <- str_split(x$a, " ")
b_characters <- str_split(x$b, " ")
stringcounting <- data.frame()
for (letter in b_characters){
count <- str_count(a_characters, letter)
sum_count <- sum(count)
stringcounting <- rbind(stringcounting, sum_count)
}
But the result here is: 1, 50, 20 rather than 2, 2, 5 (no sense as to why). I imagine something is going wrong in my for-loop and also likely in the way I've split my strings into characters, but I'm not sure what.
We can remove "."
after splitting the string since we don't want to compare that and calculate matching strings using %in%
and sum
.
mapply(function(x, y) sum(x[x != "."] %in% y[y!= "."]),
a_characters, b_characters)
#[1] 2 2 5