For my data analytics problem, I usually needs to regulate names, that names A, and B, I'd consider them the same or very similar, if A and B share substantial number of common substrings, regardless of the order of those substring.
For example, for "COLD", and c("FLOOD", "COLD/WIND CHILL"), I'd like to choose "COLD/WIND CHILL" to be much more similar to "COLD" than with "FLOOD".
My current assignment is in R. So my concrete questions are the following:
Is there such metrics already defined in R?
Is it possible to provide my own implementation and somehow integrate with R's stringdist package?
For my requirement, I could simply use regular expression search as long as I could find A in B or B in A, I may just consider their distance to be 0.
Thanks a lot!
Edit:
In the context of the following:
> vv <- c("FLOOD", "COLD/WIND CHILL")
> sapply(vv, adist, y = "COLD")
FLOOD COLD/WIND CHILL
3 11
I wish the distance from "COLD" to "COLD/WIND CHILL" would be smaller than "COLD" to "FLOOD".
It seems that the metrics has to ignore the remaining part to be deleted, after finding the matched substring.
Edit1:
My original problem has been solved. Here is a follow up with related problem of using amatch
of stringdist
in R:
It seems to me that I was not able to reproduce the equivalent result of those with adist
, and even stringdist
in the same package with amatch
.
Below is the illustration:
vv <- c("FLOOD", "COLD/WIND CHILL")
sapply(vv, adist, y = "COLD",costs=list(deletions=0))
FLOOD COLD/WIND CHILL
2 0
stringdist("COLD", c("FLOOD", " COLD/WIND CHILL"), method = 'lv', weight=c(0.001, 0.99, 0.99, 0.99))
[1] 1.981 1.002
amatch("COLD", c("FLOOD", " COLD/WIND CHILL"), method = 'lv', weight=c(0.0001, 0.999, 0.999, 0.999), maxDist = 100)
[1] 1
In the above context, by using the computation of stringdist
, amatch
should return 2
, instead of 1
.
Based on the document of stringdist,
"weight:
For method='osa' or 'dl', the penalty for deletion, insertion, substitution and transposition, in that order. When method='lv', the penalty for transposition is ignored. "
I have chosen the weights accordingly to remove penalty to deletion, while maxing the penalty to the other operations. It's encouraging that stringdist
shows the expected behavior with the weights setting.
I'd assume that amatch
would use stringdist
to do the calculation, but it seems strange the behavior of amatch
contradicts with the behavior of stringdist
!
I wish to get amatch
working so that I don't have to re-implement it using adist
or stringdist
.
Thanks for help again.
You can use adist
for fuzzy distance. The distance is a generalized Levenshtein distance.
vv <- c("COLD","FLOOD")
sapply(vv,adist,y="COLD/WIND CHILL")
## COLD FLOOD
## 11 13 ## the distance to COLD < distance to FLOOD
You can play with costs
parameter to set how you wan the distance to be computed in terms of : deletions,substitutions, insertions . Here for example:
sapply(vv, adist, y = "COLD",costs=list(deletions=0))
FLOOD COLD/WIND CHILL
2 0