Search code examples
rstringdist

stringdist on one vector


I'm trying to use stringdist to identify all strings with a max distance of 1 in the same vector, and then publish the match. Here is a sample of the data:

Starting data frame:

a = c("tom", "tomm", "alex", "alexi", "chris", "jen", "jenn", "michell") 
b = c(NA) 
df = data.frame(a,b) 

Desired results:

a = c("tom", "tomm", "alex", "alexi", "chris", "jen", "jenn", "michell") 
b = c("tomm", "tom", "alexi", "alex", 0, "jenn", "jen", 0) 
df = data.frame(a,b) 

I can use stringdist for two vectors, but am having trouble using it for one vector. Thanks for your help, R community.


Solution

  • Here's one possible approach:

    a = c("tom", "tomm", "alex", "alexi", "chris", "jen", "jenn", "michell") 
    
    min_dist <- function(x, method = "cosine", tol = .5){
        y <- vector(mode = "character", length = length(x))
        for(i in seq_along(x)){
            dis <- stringdist(x[i], x[-i], method)
            if (min(dis) > tol) {
                y[i] <- "0"
            } else {
                y[i] <- x[-i][which.min(dis)]
            }
        }
        y
    }
    
    min_dist(a, 'cosine', .4)
    
    ## [1] "tomm"  "tom"   "alexi" "alex"  "0"      "jenn"  "jen"   "0"