Search code examples
rduplicatesrecord-linkage

Deduplicate a table based on values in 2 columns + fuzzy matching


I have a CSV file exported from Zotero with metadata of my library entries. I know it contains quite a few duplicates, but it's not that easy to get rid of them:

  • Not all items with similar titles are actually duplicates, e.g.

    | Year |            Author             |    Title     |
    +------+-------------------------------+--------------+
    | 2016 | Jones, Erik                   | Book Reviews |
    | 2016 | Hassner, Pierre; Jones, Erik  | Book Reviews |
    | 2010 | Adams, Laura L.; Gagnon, Chip | Book Reviews |
    
  • Not all items that are actually similar have 100% identical metadata strings, e.g.

    |    Author     |                     Title                     |
    +---------------+-----------------------------------------------+
    | Tichý, Lukáš; | Can Iran Reduce EU Dependence on Russian Gas? |
    | Tichy, L.;    | "can iran reduce eu dependence onrussian gas" |
    

That is an extreme example (differences are not that big usually), but as you can see, pre-cleaning won't exactly solve this; so the idea is to eliminate rows that hold similar values in two+ columns - say, "Author" and "Title".

What I've tried/looked through so far:

  • OpenRefine - barely familiar with it, so could not come up with or find anything viable.
  • Excel fuzzy lookup extension - does not really work the way I need.
  • Python - again, I'm not good with the language; and I couldn't find any relevant solutions/guides.
  • R: tried out a few ideas:

First use agrep in for loop on "Author" column to get indexes of rows with duplicates; then do the same for "Title" column; and then compare the vectors and dedupe the rows where values coincide. Needless to say, I couldn't move beyond step 1:

titles <- unlist(corpus$"Title")
for (i in 1:length(titles)){
  Title_dupe_temp <- agrep(titles[i], titles[i+1:length(titles)], 
                           max.distance = 1, ignore.case = TRUE, fixed = FALSE)
  Title_dupes[i] <- paste(i, Title_dupe_temp, sep = " ")
}

The result is (almost) complete gibberish; plus I get warning messages:

In Title_dupes[i] <- paste(i, Title_dupe_temp, sep = " ") :
  number of items to replace is not a multiple of replacement length

I've also read through fuzzywuzzyR documentation, but did not find any functions that could help.

Finally, I've tried RecordLinkage package. Still, I could not go past the basics. The documentation is rather heavy and not explicit on all things; guides are scarce, and the ones I've found (e.g. this) use the example datasets which have identity vectors ready - and so I could not figure it out how to replicate that on my data.

So at this point I don't care whether to do it in OpenRefine/R/Py/SQL/whatever, just to do it in any way.


Solution

  • Solution I: using a loop and the library stringdist:

    library(stringdist)
        zotero<-data.frame(
          Year=c(2016,2016,2010,2010,2010,2010),
          Author=c("Jones, Erik","Hassner, Pierre;","Adams, Laura L.;","Tichý, Lukáš;","Tichý, Lukáš;","Tichy, L.;"),
          Title=c("Book Reviews","Book Reviews","Book Reviews","Can Iran Reduce EU Dependence on Russian Gas?","Can Iran Reduce EU Dependence on Russian Gas?","can iran reduce eu dependence onrussian gas")
        )
    
        zotero$onestring<-paste0(zotero$Year,zotero$Author,zotero$Title)
        zotero<-zotero[order(zotero[,1],zotero[,2]),]
    
        atot<-NULL
        for (i in 2:dim(zotero)[1]){
          a<-stringdist(zotero$onestring[i-1],zotero$onestring[i])/(nchar(zotero$onestring[i-1])+nchar(zotero$onestring[i]))
          atot<-rbind(atot,a)
        }
    
        zotero<-cbind(zotero,threshold=c(1,atot))
        zotero[zotero$threshold>0.15,]
    

    Solution II: it may be faster to compute that with a matrix than with a loop: First I create a dataframe based on your data sample, second I remove non-UTF characters, third I use the library stringdist to compute a distance matrix. You can easily convert these to percentages of similarity.

    zotero<-data.frame(
      Year=c(2016,2016,2010,2010,2010,2010),
      Author=c("Jones, Erik","Hassner, Pierre;","Adams, Laura L.;","Tichý, Lukáš;","Tichý, Lukáš;","Tichy, L.;"),
      Title=c("Book Reviews","Book Reviews","Book Reviews","Can Iran Reduce EU Dependence on Russian Gas?","Can Iran Reduce EU Dependence on Russian Gas?","can iran reduce eu dependence onrussian gas")
    )
    
    zotero$onestring<-paste0(zotero$Year,zotero$Author,zotero$Title)
    
    Encoding(zotero$onestring) <- "UTF-8"
    zotero$onestring<-iconv(zotero$onestring, "UTF-8", "UTF-8",sub='')
    
    library(stringdist)
    stringdistmatrix(zotero$onestring)
    

    Result:

    > stringdistmatrix(zotero$onestring)
       1  2  3  4  5
    2 11            
    3 13 14         
    4 47 45 44      
    5 47 45 44  0   
    6 47 45 42 13 13