If I have one vector of names, say:
a = c("tom", "tommy", "alex", "tom", "alexis", "Alex", "jenny", "Al", "michell")
I want to get use levenshteinSim
or similar to get a similarity score within this vector. However, I don't want it to self score. For example, "tom" #1
to score against "tom" #3
. And not to return a score for "tom" #1
against "tom" #1
so not to self score.
I have done it previously with two different vectors a
and b
. However, if I use this for the same vectors then "tom" #1
will score against "tom" #1
which is what I want to avoid.
Is there a way to do this?
You can use combn
to generate all unordered pairs of elements of a
:
a <- c("tom", "tommy", "alex", "tom", "alexis", "Alex", "jenny", "Al", "michell")
df <- data.frame(t(combn(a, 2)), stringsAsFactors = FALSE)
df$sim <- RecordLinkage::levenshteinSim(df$X1, df$X2)
head(df)
# X1 X2 sim
# 1 tom tommy 0.6
# 2 tom alex 0.0
# 3 tom tom 1.0
# 4 tom alexis 0.0
# 5 tom Alex 0.0
# 6 tom jenny 0.0