I try to remove one of almost duplicated data from data set programmatically. My data set is logically like the table below.As you see there are two rows in data set and a human can understand easily that this two data are related and probably added by same person.
My solution to this problem is using Levenshtein to compare fields(name,address,telephone number) separately and find their similarity ratio. Then I calculate average ratio as 0.77873. This similarity results seems to low. My python code is like
from Levenshtein import ratio
name = ratio("Game of ThOnes Books for selling","Selling Game of Thrones books")
address = ratio("George Washington street","George Washington st.")
phone = ratio("555-55-55","0(555)-55-55")
total_ratio = name+address+phone
print total_ratio/3 #Average ratio
My question is what is the best way two compare row data? Which algorithms or methodologies is require for doing this?
We can compute distance matrix between rows, form clusters and select cluster members as candidates for similar rows.
Using R
and stringdistmatrix
function from stringdist
package allows distance computation between
string inputs.
The distance methods supported by stringdist are as below. See package manual for more details
#Method name; Description
#osa ; Optimal string aligment, (restricted Damerau-Levenshtein distance).
#lv ; Levenshtein distance (as in R's native adist).
#dl ; Full Damerau-Levenshtein distance.
#hamming ; Hamming distance (a and b must have same nr of characters).
#lcs ; Longest common substring distance.
#qgram ;q-gram distance.
#cosine ; cosine distance between q-gram profiles
#jaccard ; Jaccard distance between q-gram profiles
#jw ; Jaro, or Jaro-Winker distance.
#soundex ; Distance based on soundex encoding (see below)
Data:
library("stringdist")
#have modified the data slightly to include dissimilar datapoints
Date = c("07-Jan-17","06-Feb-17","03-Mar-17")
name = c("Game of ThOnes Books for selling","Selling Game of Thrones books","Harry Potter BlueRay")
address = c("George Washington street","George Washington st.","Central Avenue")
phone = c("555-55-55","0(555)-55-55","111-222-333")
DF = data.frame(Date,name,address,phone,stringsAsFactors=FALSE)
DF
# Date name address phone
#1 07-Jan-17 Game of ThOnes Books for selling George Washington street 555-55-55
#2 06-Feb-17 Selling Game of Thrones books George Washington st. 0(555)-55-55
#3 03-Mar-17 Harry Potter BlueRay Central Avenue 111-222-333
Hierarchical Clustering:
rowLabels = sapply(DF[,"name"],function(x) paste0(head(unlist(strsplit(x," ")),2),collapse="_" ) )
#create string distance matrix, hierarchical cluter object and corresponding plot
nameDist = stringdistmatrix(DF[,"name"])
nameHC = hclust(nameDist)
plot(nameHC,labels = rowLabels ,main="HC plot : name")
addressDist = stringdistmatrix(DF[,"address"])
addressDistHC = hclust(addressDist)
plot(addressDistHC ,labels = rowLabels, main="HC plot : address")
phoneDist = stringdistmatrix(DF[,"phone"])
phoneHC = hclust(phoneDist)
plot(phoneHC ,labels = rowLabels, main="HC plot : phone" )
Similar Rows:
The rows consistently form two clusters in this dataset, to identify members of the clusters we can do
clusterDF = data.frame(sapply(DF[,-1],function(x) cutree(hclust(stringdistmatrix(x)),2) ))
clusterDF$rowSummary = rowSums(clusterDF)
clusterDF
# name address phone rowSummary
#1 1 1 1 3
#2 1 1 1 3
#3 2 2 2 6
#row frequency
rowFreq = table(clusterDF$rowSummary)
#3 6
#2 1
#we filter rows with frequency > 1
similarRowValues = as.numeric(names(which(rowFreq>1)))
DF[clusterDF$rowSummary == similarRowValues,]
# Date name address phone
#1 07-Jan-17 Game of ThOnes Books for selling George Washington street 555-55-55
#2 06-Feb-17 Selling Game of Thrones books George Washington st. 0(555)-55-55
This demo has worked well with the simple/toy dataset but on real dataset you would have to tinker with string distance computation methods,number of clusters, etc. but I hope this points you in the right direction.