I'm trying to create a unique column in a data frame that has a numeric of the character matches between two strings from the left side of both strings.
Each row represents has a comparison string, which we want to use as a test against a user given string. Given a dataframe:
df <- data.frame(x=c("yhf", "rnmqjk", "wok"), y=c("yh", "rnmj", "ok"))
x y
1 yhf yh
2 rnmqjk rnmj
3 wok ok
Where x is our comparison string and y is our given string, I'm looking to have the values of "2, 3, 0" output in column z., like so:
x y z
1 yhf yh 2
2 rnmqjk rnmj 3
3 wok ok 0
Essentially, I'm looking to have the given strings (y) checked from left -> right against a comparison string (x), and when the characters don't line up to not check the rest of the string and record the match numbers.
Thank you in advance!
This code works for your example:
df$z <- mapply(function(x, y) which.max(x != y),
strsplit(as.character(df$x), split=""),
strsplit(as.character(df$y), split="")) - 1
df
x y z
1 yhf yh 2
2 rnmqjk rnmj 3
3 wok ok 0
As an outline, strsplit
splits a string vector into a list of character vectors. Here, each element of a vector is a single character (with the split="" argument). The which.max
function returns the first position where it's argument is the maximum of the vector. Since The vectors returned by x != y
are logical, which.max
returns the first position where a difference is observed. mapply
takes a function and lists and applies the provided function to corresponding elements of the lists.
Note that this produces warnings that the lengths of the strings don't match. This could be addressed in a couple of ways, the easiest is wrapping the function in suppressWarnings
if the messages bug you.
As the OP notes int the comments if there are instances where the entire word matches, then which.max
returns 1. To return the same length as the string, I'd add a second line of code that combines logical subsetting with the nchar
function:
df$z[as.character(df$x) == as.character(df$y)] <-
nchar(as.character(df$x[as.character(df$x) == as.character(df$y)]))