I have two data.frame one containing partial name and the other one containing full name as follow
partial <- data.frame( "partial.name" = c("Apple", "Apple", "WWF",
"wizz air", "WeMove.eu", "ILU")
full <- data.frame("full.name" = c("Apple Inc", "wizzair", "We Move Europe",
"World Wide Fundation (WWF)", "(ILU)", "Ilusion")
In the ideal world, I would love to have a table like this (my real partial df has 12 794 rows)
print(partial)
partial full
Apple Apple Inc
Apple Apple Inc
WWF World Wide Fundation (WWF)
wizz air wizzair
WeMove.eu We Move Europe
... 12 794 total rows
For every row without an answer I would like to be NA
I tried many things, fuzzyjoin
with regex
, regex_left_join
even with the package sqldf
. I have some results but I know it would be better if regex_left_join
understand that I am looking for words I know in stringr
, boundary( type = c("word"))
exist but I do not know of to implement it.
For now, I just prepared the partial df, to get rid of the non-alphanumerical information and to make it lowercase.
partial$regex <- str_squish((str_replace_all(partial$partial.name, regex("\\W+"), " ")))
partial$regex <- tolower(partial$regex)
How can I match partial$partial.name
with full$full.name
based on the maximum number of words in common?
Partial string matching is time consuming to get right. I believe the Jaro-Winkler distance is a good candidate but you would need to spend time tweaking parameters. Here's an example to get you going.
library(stringdist)
partial <- data.frame( "partial.name" = c("Apple", "Apple", "WWF", "wizz air", "WeMove.eu", "ILU", 'None'), stringsAsFactors = F)
full <- data.frame("full.name" = c("Apple Inc", "wizzair", "We Move Europe", "World Wide Foundation (WWF)", "(ILU)", "Ilusion"), stringsAsFactors = F)
mydist <- function(partial, list_of_fulls, method='jw', p = 0, threshold = 0.4) {
find_dist <- function(first, second, method = method, p = p) {
stringdist(a = first, b = second, method = method, p = p)
}
distances <- unlist(lapply(list_of_fulls, function(full) find_dist(first = full, second = partial, method = method, p = p)))
# If the distance is too great assume NA
if (min(distances) > threshold) {
NA
} else {
closest_index <- which.min(distances)
list_of_fulls[closest_index]
}
}
partial$match <- unlist(lapply(partial$partial.name, function(partial) mydist(partial = partial, list_of_fulls = full$full.name, method = 'jw')))
partial
# partial.name match
#1 Apple Apple Inc
#2 Apple Apple Inc
#3 WWF World Wide Foundation (WWF)
#4 wizz air wizzair
#5 WeMove.eu We Move Europe
#6 ILU (ILU)
#7 None <NA>