Search code examples
rpattern-matchingfuzzy-search

Getting positions of approximate substrings accross two data frames in R


I have two data frames. The first (word.library) includes the strings which should match approximately the strings in the second data frame (targetframe).

word.library <- data.frame(mainword = c("important word",
                                                "crazy sayings"))    

tragetframe <- data.frame(words= c("Important Words",
                                           "I would also Importante worde of thes substring",
                                           "No mention of this crazy sayingsys"))

I only figured out the one by one solution (loops as well), however this does not satisfy my needs:

positions <- aregexec(word.library[1,1], tragetframe$words, max.distance = 0.1)

positions <- aregexec(word.library[2,1], tragetframe$words, max.distance = 0.1)

Finally: I´m looking for a solution to do this for all strings in column word.library$mainword at once. Has anyone a good idea? Thx.


Solution

  • find <- function(library.vec, frame.vec) {
      aregexec(library.vec, frame.vec, max.distance = 0.1)
    }
    

    If a function is created from the expression that you tried, you will be able to include it in the apply family functions to repeat over the word library.

    mapply(find, word.library[,1], list(tragetframe[,1]))
    #     [,1] [,2]
    #[1,] 1    -1  
    #[2,] 14   -1  
    #[3,] -1   20 
    

    The attributes are dropped in the process. The output is arranged by column for each word. If you want to keep the attributes try:

    lapply(word.library[,1], find, tragetframe[,1])