Search code examples
rstringdataframestr-replace

How to replace many strings written in different ways to a unified way of writing the term?


I have this df

df = data.frame(x = c('Orange','orange','Appples','orgne','apple','applees','oranges','Oranges',
                      'orgens','orgaanes','Apples','ORANGES','apple','APPLE') )

using str_replace_all, I know I can replace each one of these terms to a one unified way of writing each of the 2 words orange and apple but it would take forever if you have a lot of terms in the dataframe. Would wanna a simple way of coding in order to unify all the ways of writing into orange and apple.


Solution

  • You can use agrep for approximate string matching:

    for (i in c("orange", "apple")){
      df$x[agrep(i, df$x, max.distance = 2, ignore.case = TRUE)] <- i
      df$x
    }
    
    #[1] "orange"   "orange"   "apple"    "orange"   "apple"   "apple"    "orange"   "orange"   "orange"   "orgaanes" "apple"    "orange"   "apple"    "apple"  
    

    You can change the sensitivity of the distances with max.distance.


    Another possibility is the stringdist package, which has a number of different distance metrics:

    library(stringdist)
    v <- c("orange", "apple")
    v[amatch(tolower(df$x), v, maxDist = 3)]