Search code examples
rstringdataframestring-matching

Partial String Matching by Row


I'm trying to create a unique column in a data frame that has a numeric of the character matches between two strings from the left side of both strings.

Each row represents has a comparison string, which we want to use as a test against a user given string. Given a dataframe:

df <- data.frame(x=c("yhf", "rnmqjk", "wok"), y=c("yh", "rnmj", "ok"))

       x    y
1    yhf   yh
2 rnmqjk rnmj
3    wok   ok

Where x is our comparison string and y is our given string, I'm looking to have the values of "2, 3, 0" output in column z., like so:

       x    y    z
1    yhf   yh    2
2 rnmqjk rnmj    3
3    wok   ok    0

Essentially, I'm looking to have the given strings (y) checked from left -> right against a comparison string (x), and when the characters don't line up to not check the rest of the string and record the match numbers.

Thank you in advance!


Solution

  • This code works for your example:

    df$z <- mapply(function(x, y) which.max(x != y),
                   strsplit(as.character(df$x), split=""),
                   strsplit(as.character(df$y), split="")) - 1
    
    df
           x    y z
    1    yhf   yh 2
    2 rnmqjk rnmj 3
    3    wok   ok 0
    

    As an outline, strsplit splits a string vector into a list of character vectors. Here, each element of a vector is a single character (with the split="" argument). The which.max function returns the first position where it's argument is the maximum of the vector. Since The vectors returned by x != y are logical, which.max returns the first position where a difference is observed. mapply takes a function and lists and applies the provided function to corresponding elements of the lists.

    Note that this produces warnings that the lengths of the strings don't match. This could be addressed in a couple of ways, the easiest is wrapping the function in suppressWarnings if the messages bug you.


    As the OP notes int the comments if there are instances where the entire word matches, then which.max returns 1. To return the same length as the string, I'd add a second line of code that combines logical subsetting with the nchar function:

    df$z[as.character(df$x) == as.character(df$y)] <-
                            nchar(as.character(df$x[as.character(df$x) == as.character(df$y)]))