Search code examples
pythonralgorithmsimilaritycosine-similarity

How to find similarity between two row data


I try to remove one of almost duplicated data from data set programmatically. My data set is logically like the table below.As you see there are two rows in data set and a human can understand easily that this two data are related and probably added by same person.

enter image description here

My solution to this problem is using Levenshtein to compare fields(name,address,telephone number) separately and find their similarity ratio. Then I calculate average ratio as 0.77873. This similarity results seems to low. My python code is like

from Levenshtein import ratio
name     =  ratio("Game of ThOnes Books for selling","Selling Game of Thrones books")
address  =  ratio("George Washington street","George Washington st.")
phone    =  ratio("555-55-55","0(555)-55-55")

total_ratio = name+address+phone
print total_ratio/3 #Average ratio

My question is what is the best way two compare row data? Which algorithms or methodologies is require for doing this?


Solution

  • We can compute distance matrix between rows, form clusters and select cluster members as candidates for similar rows.

    Using R and stringdistmatrix function from stringdist package allows distance computation between string inputs.

    The distance methods supported by stringdist are as below. See package manual for more details

    #Method name;   Description
    #osa    ; Optimal string aligment, (restricted Damerau-Levenshtein distance).
    #lv ; Levenshtein distance (as in R's native adist).
    #dl ; Full Damerau-Levenshtein distance.
    #hamming    ; Hamming distance (a and b must have same nr of characters).
    #lcs    ; Longest common substring distance.
    #qgram  ;q-gram distance.
    #cosine ; cosine distance between q-gram profiles
    #jaccard    ; Jaccard distance between q-gram profiles
    #jw ; Jaro, or Jaro-Winker distance.
    #soundex    ; Distance based on soundex encoding (see below)
    

    Data:

    library("stringdist")
    
    #have modified the data slightly to include dissimilar datapoints
    Date = c("07-Jan-17","06-Feb-17","03-Mar-17")
    name     =  c("Game of ThOnes Books for selling","Selling Game of Thrones books","Harry Potter BlueRay")
    address  =  c("George Washington street","George Washington st.","Central Avenue")
    phone    =  c("555-55-55","0(555)-55-55","111-222-333")
    DF = data.frame(Date,name,address,phone,stringsAsFactors=FALSE)
    
    DF
    #       Date                             name                  address        phone
    #1 07-Jan-17 Game of ThOnes Books for selling George Washington street    555-55-55
    #2 06-Feb-17    Selling Game of Thrones books    George Washington st. 0(555)-55-55
    #3 03-Mar-17             Harry Potter BlueRay           Central Avenue  111-222-333
    

    Hierarchical Clustering:

    rowLabels = sapply(DF[,"name"],function(x) paste0(head(unlist(strsplit(x," ")),2),collapse="_" ) )
    
    #create string distance matrix, hierarchical cluter object and corresponding plot
    nameDist = stringdistmatrix(DF[,"name"])
    nameHC = hclust(nameDist)
    
    plot(nameHC,labels = rowLabels ,main="HC plot : name")
    

    enter image description here

    addressDist = stringdistmatrix(DF[,"address"])
    addressDistHC = hclust(addressDist)
    
    plot(addressDistHC ,labels = rowLabels, main="HC plot : address")
    

    enter image description here

    phoneDist = stringdistmatrix(DF[,"phone"])
    phoneHC = hclust(phoneDist)
    
    plot(phoneHC ,labels = rowLabels, main="HC plot : phone" )
    

    enter image description here

    Similar Rows:

    The rows consistently form two clusters in this dataset, to identify members of the clusters we can do

    clusterDF = data.frame(sapply(DF[,-1],function(x) cutree(hclust(stringdistmatrix(x)),2) ))
    clusterDF$rowSummary = rowSums(clusterDF)
    
    clusterDF
    #  name address phone rowSummary
    #1    1       1     1          3
    #2    1       1     1          3
    #3    2       2     2          6
    
    
    #row frequency
    
    rowFreq  = table(clusterDF$rowSummary)
    #3 6 
    #2 1
    
    #we filter rows with frequency > 1
    similarRowValues  = as.numeric(names(which(rowFreq>1)))
    
    
    DF[clusterDF$rowSummary == similarRowValues,]
    #       Date                             name                  address        phone
    #1 07-Jan-17 Game of ThOnes Books for selling George Washington street    555-55-55
    #2 06-Feb-17    Selling Game of Thrones books    George Washington st. 0(555)-55-55
    

    This demo has worked well with the simple/toy dataset but on real dataset you would have to tinker with string distance computation methods,number of clusters, etc. but I hope this points you in the right direction.