Search code examples
rlevenshtein-distancefuzzy-logicstringdistrecord-linkage

RecordLinkage - R one vector. Do not match to self


If I have one vector of names, say:

a = c("tom", "tommy", "alex", "tom", "alexis", "Alex", "jenny", "Al", "michell")

I want to get use levenshteinSim or similar to get a similarity score within this vector. However, I don't want it to self score. For example, "tom" #1 to score against "tom" #3. And not to return a score for "tom" #1 against "tom" #1 so not to self score.

I have done it previously with two different vectors a and b. However, if I use this for the same vectors then "tom" #1 will score against "tom" #1 which is what I want to avoid.

Is there a way to do this?


Solution

  • You can use combn to generate all unordered pairs of elements of a:

    a <- c("tom", "tommy", "alex", "tom", "alexis", "Alex", "jenny", "Al", "michell")
    
    df <- data.frame(t(combn(a, 2)), stringsAsFactors = FALSE)
    df$sim <- RecordLinkage::levenshteinSim(df$X1, df$X2)
    
    head(df)
    #    X1     X2 sim
    # 1 tom  tommy 0.6
    # 2 tom   alex 0.0
    # 3 tom    tom 1.0
    # 4 tom alexis 0.0
    # 5 tom   Alex 0.0
    # 6 tom  jenny 0.0