Search code examples
rfuzzy-searchfuzzy-comparison

Fuzzy Match and replace strings in dataframe using custom dictionnary


I have this dataframe with similars (strings with small syntax differences)

 place1 <- c("pondichery ", "Pondichery", "Pondichéry", "Port-Louis", "Port Louis  ")
 place2 <- c("Lorent", "Pondichery", " Lorient", "port-louis", "Port Louis")
 place3 <- c("Loirent", "Pondchéry", "Brest", "Port Louis", "Nantes")

 places2clean <- data.frame(place1, place2, place3)

Here is my custom dictionnary

  dictionnary <- c("Pondichéry", "Lorient", "Port-Louis", "Nantes", "Brest")

  dictionnary <- data.frame(dictionnary)

I want to match and replace all strings based on a custom dictionnary.

Expecteds results :

    place1     place2     place3
 Pondichéry     Lorient    Lorient
 Pondichéry Pondichéry Pondichéry
 Pondichéry    Lorient      Brest
 Port-Louis Port-Louis Port Louis
 Port-Louis   Port-Louis     Nantes

How can I use stringdistance for matching and replacing over all the dataframe?


Solution

  • The following first computes matrices of distances between each column and the dictionary and then gets the strings that have a smaller distance.

    library(stringdist)
    
    places2clean[] <- lapply(places2clean, trimws)
    
    d <- lapply(places2clean, function(x) {
      sapply(dictionnary$dictionnary, function(y) stringdist(x, y))
    })
    res <- sapply(d, function(x){
      inx <- apply(x, 1, which.min)
      dictionnary$dictionnary[inx]
    })
    
    as.data.frame(res)
    #      place1     place2     place3
    #1 Pondichéry    Lorient    Lorient
    #2 Pondichéry Pondichéry Pondichéry
    #3 Pondichéry    Lorient      Brest
    #4 Port-Louis Port-Louis Port-Louis
    #5 Port-Louis Port-Louis     Nantes