Search code examples
rstringdist

How to get nearest matching string along with score from column from another table?


I am trying to get nearest matching string along with the score by using "stringdist" package with method = jw.(Jaro-winkler)

First data frame (df_1) consists of 2 columns and I want to get the nearest string from str_2 from df_2 and score for that match.
I have gone through the package and found some solution which I will show below:

    year = c(2001,2001,2002,2003,2005,2006)
    str_1 =c("The best ever Puma wishlist","I finalised on buy a top from Myntra","Its perfect for a day at gym",
             "Check out PUMA Unisex Black Running","i have been mailing my issue daily","xyz")
    
    df_1 = data.frame(year,str_1)
    
    ID = c(100,211,155,367,678,2356,927,829,397)
    str_2 = c("VeRy4G3c7X","i have been mailing my issue twice","I finalised on buy a top from jobong",
              "Qm9dZzJdms","Check out PUMA Unisex Black Running","The best Puma wishlist","Its a day at gym",
              "fOpBRWCdSh","")

    df_2 = data.frame(ID,str_2)

I need to get the nearest match from str_2 column from df_2, and the final table would look like below with:

    stringdist(  a,  b,  method = c( "jw")

    df_1$Nearest_matching = c("The best Puma wishlist","I finalised on buy a top from jobong","Its a day at gym","Check out PUMA Unisex Black Running","i have been mailing my issue twice",NA) 
    df_1$Nearest_matching_score =c(0.099,0.092,0.205,0,0.078,NA).

Solution

  • Here is what I came to based on the documentation of the stringdist package:

    First I created a distance matrix between str_1 and str_2, then I assigned column names to it like this:

    nearest_matching <- stringdistmatrix(df_1$str_1,df_2$str_2,  method = "jw")
    colnames(nearest_matching) <- str_2
    

    Then I selected the smallest value (distance) from each row.

    apply(nearest_matching, 1, FUN = min)
    

    output:

    > apply(nearest_matching, 1, FUN = min)
    [1] 0.09960718 0.09259259 0.20535714 0.00000000 0.07843137 0.52222222
    

    Finally, I wrote out the column names corresponding to these values:

    colnames(nearest_matching)[apply(nearest_matching, 1, FUN = which.min)]
    

    output:

    > colnames(nearest_matching)[apply(nearest_matching, 1, FUN = which.min)]
    [1] "The best Puma wishlist"               "I finalised on buy a top from jobong" "Its a day at gym"                    
    [4] "Check out PUMA Unisex Black Running"  "i have been mailing my issue twice"   "VeRy4G3c7X"