Search code examples
rtext-mining

How to do an approx match and replace with correct word in R?


list1 <- c("prmum","prum","primium","prm","prim","primum","prem","premum",
           "wrng","wng",
           "hug","hung",
           "amut",
           "chq","chquked","cheuq","chek","cheq",
           "cus","cust",
           "cbk","cb",
           "ringirng","rining","rigirigi")


list2 <- c("premium","wrong","hang","amount","cheque","customer","callback","ringing")
dat <- as.data.frame(list1)
for(i in length(list1)){
t <- agrep(list1[i],list2,value=FALSE)
 dat[t] <- list2[i]
}

I have two lists one having wrong_words and other correct_words. I am trying to do the following:

1)Take first word.
2)Do approx match from list of correct_words and get the index location.
3)Replace the wrong word with the correct word at that particular location in the dataframe or a list.


Solution

  • You can do it using stringdistmatrix from package stringdist. It uses Levenshtein distance just like agrep. You find which word has the closest match, and replace it in your original list.

     library(stringdist)
     dist_mat <- stringdistmatrix(list1, list2)
     clean_list1 <- list2[apply(dist_mat, 1, which.min)]
     clean_list1
    

    Now this solution may be inappropriate if you have very long lists (assume they are of length l1 and l2, you will get a matrix of size l1*l2). You may need to go through a loop to reduce memory consumption.

    clean_list1 <- list1
    for (i in length(list1)){
         dist_vect <- stringdistmatrix(list1[i],list2)
         clean_list1 <- list2[which.min(dist_vect)]
    }