list1 <- c("prmum","prum","primium","prm","prim","primum","prem","premum",
"wrng","wng",
"hug","hung",
"amut",
"chq","chquked","cheuq","chek","cheq",
"cus","cust",
"cbk","cb",
"ringirng","rining","rigirigi")
list2 <- c("premium","wrong","hang","amount","cheque","customer","callback","ringing")
dat <- as.data.frame(list1)
for(i in length(list1)){
t <- agrep(list1[i],list2,value=FALSE)
dat[t] <- list2[i]
}
I have two lists one having wrong_words and other correct_words. I am trying to do the following:
1)Take first word.
2)Do approx match from list of correct_words and get the index location.
3)Replace the wrong word with the correct word at that particular location
in the dataframe or a list.
You can do it using stringdistmatrix
from package stringdist
. It uses Levenshtein distance just like agrep. You find which word has the closest match, and replace it in your original list.
library(stringdist)
dist_mat <- stringdistmatrix(list1, list2)
clean_list1 <- list2[apply(dist_mat, 1, which.min)]
clean_list1
Now this solution may be inappropriate if you have very long lists (assume they are of length l1 and l2, you will get a matrix of size l1*l2). You may need to go through a loop to reduce memory consumption.
clean_list1 <- list1
for (i in length(list1)){
dist_vect <- stringdistmatrix(list1[i],list2)
clean_list1 <- list2[which.min(dist_vect)]
}