Search code examples
rdplyrcosine-similaritystringdistrecord-linkage

Computing similarity % in text strings by excluding the identical entries in R


the given R script computes the similarity in % between two names as shown in the visual. Here we have two columns "names1" and "names2" with their respective ids in id1 and id2. My requirement is that when we execute the script, each name in "names1" gets compared with each name in "names2" column, I do not want the same entry i.e. (id1,names1) column to be compared with its identical entry in (id2,names2) column. For Illustration, the first (id1,names1) entry (1,Prabhudev Ramanujam) should get compared with all (id2,names2) but not with the first (id2,names2) entry. Similarly for all pairs. Also, if the formula

percent(sapply(names1, function(i)RecordLinkage::levenshteinSim(i,names2))) 

can be tweaked to produce a similar and faster result here as it slows down on large data, Attaching the snapshot, please help.

library(stringdist)
library(RecordLinkage)
library(dplyr)
library(scales)
id1    <- 1:8 
names1 <- c("Prabhudev Ramanujam","Deepak Subramaniam","Sangamer 
Mahapatra","SriramKishore Sharma",
        "Deepak Subramaniam","SriramKishore Sharma","Deepak 
Subramaniam","Sangamer Mahapatra")
id2    <- c(1,2,3,4,11,13,9,10)
names2 <- c("Prabhudev Ramanujam","Deepak Subramaniam","Sangamer 
Mahapatra","SriramKishore Sharma",
        "Deepak Subramaniam","Sangamer Mahapatra","SriramKishore 
Sharma","Deepak Subramaniam")
Name_Data <- data.frame(id1,names1,id2,names2)
Percent<- percent(sapply(names1, function(i) 
RecordLinkage::levenshteinSim(i,names2)))
Total_Value <- data.frame(id2,names2,Percent)

Snapshot visual


Solution

  • Not much faster, but my suggestion would be:

    percent(unlist(lapply(1:length(names1), function(x) {
      levenshteinSim(names1[x], names2[!(names2==names1[x] & id2==id1[x])])})))
    

    Edit:

    Alternatively, this might be faster - I guess it varies:

    as.vector(t(1 - (stringdistmatrix(names1, names2, method = "lv") / 
             outer(nchar(names1), nchar(names2), pmax))))[unlist(lapply(1:length(names1), function(x) !(names2==names1[x] & id2==id1[x])))]