I am trying to get nearest matching string along with the score by using "stringdist"
package with method = jw.(Jaro-winkler)
First data frame (df_1
) consists of 2 columns and I want to get the nearest string from str_2
from df_2
and score for that match.
I have gone through the package and found some solution which I will show below:
year = c(2001,2001,2002,2003,2005,2006)
str_1 =c("The best ever Puma wishlist","I finalised on buy a top from Myntra","Its perfect for a day at gym",
"Check out PUMA Unisex Black Running","i have been mailing my issue daily","xyz")
df_1 = data.frame(year,str_1)
ID = c(100,211,155,367,678,2356,927,829,397)
str_2 = c("VeRy4G3c7X","i have been mailing my issue twice","I finalised on buy a top from jobong",
"Qm9dZzJdms","Check out PUMA Unisex Black Running","The best Puma wishlist","Its a day at gym",
"fOpBRWCdSh","")
df_2 = data.frame(ID,str_2)
I need to get the nearest match from str_2
column from df_2
, and the final table would look like below with:
stringdist( a, b, method = c( "jw")
df_1$Nearest_matching = c("The best Puma wishlist","I finalised on buy a top from jobong","Its a day at gym","Check out PUMA Unisex Black Running","i have been mailing my issue twice",NA)
df_1$Nearest_matching_score =c(0.099,0.092,0.205,0,0.078,NA).
Here is what I came to based on the documentation of the stringdist
package:
First I created a distance matrix between str_1 and str_2, then I assigned column names to it like this:
nearest_matching <- stringdistmatrix(df_1$str_1,df_2$str_2, method = "jw")
colnames(nearest_matching) <- str_2
Then I selected the smallest value (distance) from each row.
apply(nearest_matching, 1, FUN = min)
output:
> apply(nearest_matching, 1, FUN = min)
[1] 0.09960718 0.09259259 0.20535714 0.00000000 0.07843137 0.52222222
Finally, I wrote out the column names corresponding to these values:
colnames(nearest_matching)[apply(nearest_matching, 1, FUN = which.min)]
output:
> colnames(nearest_matching)[apply(nearest_matching, 1, FUN = which.min)]
[1] "The best Puma wishlist" "I finalised on buy a top from jobong" "Its a day at gym"
[4] "Check out PUMA Unisex Black Running" "i have been mailing my issue twice" "VeRy4G3c7X"