Search code examples
rfuzzy-searchdata-cleaning

Replace duplicate values from dataframe column using fuzzy match


I am trying to use library('RecordLinkage') and the compare.dedup() function to replace duplicate values in a single column.

Similar to this post, I have one vector

tv3 = c("TOURDEFRANCE", 'TOURDEFRANCE', "TOURDE FRANCE", "TOURDE FRANZ", "GET FRESH") 

and the output I want is the following, based on a set value of weight (e.g. > 0.8):

("TOURDEFRANCE", 'TOURDEFRANCE', "TOURDEFRANCE", "TOURDEFRANCE", "GET FRESH") 

Here's the code I tried to get the matches dataframe:

tv3 = as.data.frame(c("TOURDEFRANCE", 'TOURDEFRANCE', "TOURDE FRANCE", 
                  "TOURDE FRANZ", "GET FRESH"))
colnames(tv3) <- "name"
tv3 %>% compare.dedup(strcmp = TRUE) %>%
    epiWeights() %>%
    epiClassify(0.8) %>%
    getPairs(show = "links", single.rows = TRUE) -> matches

However to get what I need, I used the below loop:

matches <- matches[order(matches$id1),] 
tv3new <- tv3
for (i in 1:nrow(matches)) {
  tv3new[tv3new$name==matches[i,'name.2'],] <- matches[i,'name.1']
} 
tv3new

This gives me what I want, but wondering if using a loop is the best way to do it, or if I am missing something obvious.


Solution

  • Without a loop:

    tv3new <- c(as.character(matches[tv3$name %in% matches$name.2*1, 2]), 
              as.character(tv3[!tv3$name %in% matches$name.2, ]))
    # If we need a data frame
    data.frame(name = tv3new)
    

    Output:

              name
    1 TOURDEFRANCE
    2 TOURDEFRANCE
    3 TOURDEFRANCE
    4 TOURDEFRANCE
    5    GET FRESH