I have this dataframe with similars (strings with small syntax differences)
place1 <- c("pondichery ", "Pondichery", "Pondichéry", "Port-Louis", "Port Louis ")
place2 <- c("Lorent", "Pondichery", " Lorient", "port-louis", "Port Louis")
place3 <- c("Loirent", "Pondchéry", "Brest", "Port Louis", "Nantes")
places2clean <- data.frame(place1, place2, place3)
Here is my custom dictionnary
dictionnary <- c("Pondichéry", "Lorient", "Port-Louis", "Nantes", "Brest")
dictionnary <- data.frame(dictionnary)
I want to match and replace all strings based on a custom dictionnary.
Expecteds results :
place1 place2 place3
Pondichéry Lorient Lorient
Pondichéry Pondichéry Pondichéry
Pondichéry Lorient Brest
Port-Louis Port-Louis Port Louis
Port-Louis Port-Louis Nantes
How can I use stringdistance for matching and replacing over all the dataframe?
The following first computes matrices of distances between each column and the dictionary and then gets the strings that have a smaller distance.
library(stringdist)
places2clean[] <- lapply(places2clean, trimws)
d <- lapply(places2clean, function(x) {
sapply(dictionnary$dictionnary, function(y) stringdist(x, y))
})
res <- sapply(d, function(x){
inx <- apply(x, 1, which.min)
dictionnary$dictionnary[inx]
})
as.data.frame(res)
# place1 place2 place3
#1 Pondichéry Lorient Lorient
#2 Pondichéry Pondichéry Pondichéry
#3 Pondichéry Lorient Brest
#4 Port-Louis Port-Louis Port-Louis
#5 Port-Louis Port-Louis Nantes