Search code examples
rstringrsqldfstringdistfuzzyjoin

Merging two data frame based on maximum numbers of words in commonin R


I have two data.frame one containing partial name and the other one containing full name as follow

partial <- data.frame( "partial.name" = c("Apple", "Apple", "WWF",
"wizz air", "WeMove.eu", "ILU")
full <- data.frame("full.name" = c("Apple Inc", "wizzair", "We Move Europe",
"World Wide Fundation (WWF)", "(ILU)", "Ilusion")

In the ideal world, I would love to have a table like this (my real partial df has 12 794 rows)

print(partial)
partial full
Apple   Apple Inc
Apple   Apple Inc
WWF World Wide Fundation (WWF)
wizz air wizzair
WeMove.eu We Move Europe
... 12 794 total rows

For every row without an answer I would like to be NA

I tried many things, fuzzyjoin with regex, regex_left_join even with the package sqldf. I have some results but I know it would be better if regex_left_join understand that I am looking for words I know in stringr , boundary( type = c("word")) exist but I do not know of to implement it.

For now, I just prepared the partial df, to get rid of the non-alphanumerical information and to make it lowercase.

partial$regex <- str_squish((str_replace_all(partial$partial.name, regex("\\W+"), " ")))
partial$regex <- tolower(partial$regex)

How can I match partial$partial.name with full$full.name based on the maximum number of words in common?


Solution

  • Partial string matching is time consuming to get right. I believe the Jaro-Winkler distance is a good candidate but you would need to spend time tweaking parameters. Here's an example to get you going.

    library(stringdist)
    
    partial <- data.frame( "partial.name" = c("Apple", "Apple", "WWF", "wizz air", "WeMove.eu", "ILU", 'None'), stringsAsFactors = F)
    full <- data.frame("full.name" = c("Apple Inc", "wizzair", "We Move Europe", "World Wide Foundation (WWF)", "(ILU)", "Ilusion"), stringsAsFactors = F)
    
    mydist <- function(partial, list_of_fulls, method='jw', p = 0, threshold = 0.4) {
        find_dist <- function(first, second, method = method, p = p) {
            stringdist(a = first, b = second, method = method, p = p)
        }
        distances <- unlist(lapply(list_of_fulls, function(full) find_dist(first = full, second = partial, method = method, p = p)))
        # If the distance is too great assume NA 
        if (min(distances) > threshold) {
            NA
        } else {
            closest_index <- which.min(distances)
            list_of_fulls[closest_index]
        }
    }
    
    partial$match <- unlist(lapply(partial$partial.name, function(partial) mydist(partial = partial, list_of_fulls = full$full.name, method = 'jw')))
    
    partial
    #  partial.name                       match
    #1        Apple                   Apple Inc
    #2        Apple                   Apple Inc
    #3          WWF World Wide Foundation (WWF)
    #4     wizz air                     wizzair
    #5    WeMove.eu              We Move Europe
    #6          ILU                       (ILU)
    #7         None                        <NA>