Search code examples
rstringedit-distancestringdist

String distance metrics that is in favor of substring, and word order independent?


For my data analytics problem, I usually needs to regulate names, that names A, and B, I'd consider them the same or very similar, if A and B share substantial number of common substrings, regardless of the order of those substring.

For example, for "COLD", and c("FLOOD", "COLD/WIND CHILL"), I'd like to choose "COLD/WIND CHILL" to be much more similar to "COLD" than with "FLOOD".

My current assignment is in R. So my concrete questions are the following:

  1. Is there such metrics already defined in R?

  2. Is it possible to provide my own implementation and somehow integrate with R's stringdist package?

For my requirement, I could simply use regular expression search as long as I could find A in B or B in A, I may just consider their distance to be 0.

Thanks a lot!

Edit:

In the context of the following:

> vv <- c("FLOOD", "COLD/WIND CHILL")
> sapply(vv, adist, y = "COLD")
          FLOOD COLD/WIND CHILL 
              3              11 

I wish the distance from "COLD" to "COLD/WIND CHILL" would be smaller than "COLD" to "FLOOD".

It seems that the metrics has to ignore the remaining part to be deleted, after finding the matched substring.

Edit1:

My original problem has been solved. Here is a follow up with related problem of using amatch of stringdist in R:

It seems to me that I was not able to reproduce the equivalent result of those with adist, and even stringdist in the same package with amatch.

Below is the illustration:

vv <- c("FLOOD", "COLD/WIND CHILL")
sapply(vv, adist, y = "COLD",costs=list(deletions=0))
          FLOOD COLD/WIND CHILL 
              2               0 

    stringdist("COLD", c("FLOOD", " COLD/WIND CHILL"), method = 'lv', weight=c(0.001, 0.99, 0.99, 0.99))
[1] 1.981 1.002

amatch("COLD", c("FLOOD", " COLD/WIND CHILL"), method = 'lv', weight=c(0.0001, 0.999, 0.999, 0.999), maxDist = 100)
[1] 1

In the above context, by using the computation of stringdist, amatch should return 2, instead of 1.

Based on the document of stringdist,

"weight:
For method='osa' or 'dl', the penalty for deletion, insertion, substitution and transposition, in that order. When method='lv', the penalty for transposition is ignored. "

I have chosen the weights accordingly to remove penalty to deletion, while maxing the penalty to the other operations. It's encouraging that stringdist shows the expected behavior with the weights setting.

I'd assume that amatch would use stringdist to do the calculation, but it seems strange the behavior of amatch contradicts with the behavior of stringdist!

I wish to get amatch working so that I don't have to re-implement it using adist or stringdist.

Thanks for help again.


Solution

  • You can use adist for fuzzy distance. The distance is a generalized Levenshtein distance.

    vv <- c("COLD","FLOOD")
    sapply(vv,adist,y="COLD/WIND CHILL")
    ## COLD FLOOD   
    ##  11    13    ## the distance to COLD < distance to FLOOD
    

    edit after OP update:

    You can play with costs parameter to set how you wan the distance to be computed in terms of : deletions,substitutions, insertions . Here for example:

    sapply(vv, adist, y = "COLD",costs=list(deletions=0))
      FLOOD COLD/WIND       CHILL 
              2               0