I'm trying to use stringdist to identify all strings with a max distance of 1 in the same vector, and then publish the match. Here is a sample of the data:
Starting data frame:
a = c("tom", "tomm", "alex", "alexi", "chris", "jen", "jenn", "michell")
b = c(NA)
df = data.frame(a,b)
Desired results:
a = c("tom", "tomm", "alex", "alexi", "chris", "jen", "jenn", "michell")
b = c("tomm", "tom", "alexi", "alex", 0, "jenn", "jen", 0)
df = data.frame(a,b)
I can use stringdist for two vectors, but am having trouble using it for one vector. Thanks for your help, R community.
Here's one possible approach:
a = c("tom", "tomm", "alex", "alexi", "chris", "jen", "jenn", "michell")
min_dist <- function(x, method = "cosine", tol = .5){
y <- vector(mode = "character", length = length(x))
for(i in seq_along(x)){
dis <- stringdist(x[i], x[-i], method)
if (min(dis) > tol) {
y[i] <- "0"
} else {
y[i] <- x[-i][which.min(dis)]
}
}
y
}
min_dist(a, 'cosine', .4)
## [1] "tomm" "tom" "alexi" "alex" "0" "jenn" "jen" "0"