Search code examples
regexrgsublevenshtein-distancestringdist

r stringdist or levenshtein.distance to replace strings


I have a large, dataset with ~ one million observations, keyed with a defined observation type. Within the dataset, there are ~900,000 observations with malformed observation types, with ~850 (incorrect) variations of the 50 acceptable observation types.

keys <- c("DAY", "EVENING","SUNSET", "DUSK","NIGHT", "MIDNIGHT", "TWILIGHT", "DAWN","SUNRISE", "MORNING")

entries <- c("Day", "day", "SUNSET/DUSK", "DAYS", "dayy", "EVEN", "Evening", "early dusk", "late day", "nite", "red dawn", "Evening Sunset", "mid-night", "midnight", "midnite","DAY", "EVENING","SUNSET", "DUSK","NIGHT", "MIDNIGHT", "TWILIGHT", "DAWN","SUNRISE", "MORNING")

Using gsub is akin to digging a basement with a hand shovel, and in my own case, a broken-handled shovel as I'm very new with r and the intricacies regular expressions. The simple fallback (for me) is to write one gsub statement for each of the accepted observation types but that seems unnecessarily arduous as it needs 50 statements.

I'd like to use levenshtein.distance or stringdist to replace the offending entries with the shortest distance string. Running z <- for (i in length(y)) { z[i] = levenshtein.distance(y[i], x)} doesn't work as it's trying to pass (length(x)) results to each y[i].

How do I return the result with the minimum distance? I've seen function(x) x[2] that returns the 2nd result in a series, but how to get the lowest?


Solution

  • You could try:

    library(stringdist)
    m <- stringdistmatrix(entries, keys, method = "lv")
    a <- keys[apply(m, 1, which.min)]
    

    If you want to experiment with different algorithm, have a look at ?'stringdist-metrics'


    Or as per mentioned by @RHertel in the comments:

    b <- keys[apply(adist(entries, keys), 1, which.min)]
    

    From adist() documentation:

    Compute the approximate string distance between character vectors. The distance is a generalized Levenshtein (edit) distance, giving the minimal possibly weighted number of insertions, deletions and substitutions needed to transform one string into another.

    The two methods yield identical results:

    > identical(a, b)
    #[1] TRUE