I have receipt data and there are descriptions of items but some are pretty similar and I'd like to code those similar ones with he same value to increase the chances of finding associations in the data. For example:
Strawberries
Premium Strawberries
Premium Strawberries
Hass Avocado
Mini Avocado
I'd like to have:
Strawberries
Strawberries
Strawberries
Avocado
Avocado
Something to that effect but I'm open to suggestions for sure.All I can think of is that some sort of fuzzy search might be what I need I just don't know how to implement that?
Thanks, once again!
One possible way are string distances. But be careful because they do not capture any meaning, just the similarity between actual strings. Below example could work like some heuristic, but pay attention to last example. Higher the threshold in cutree, less groups you will have, and probably more wrongly classified examples. Ergo lower threshold means that you are more strict, and possibly missing good solutions:
th <- 0.35 ## between 0 and 1
roles <- c("Strawberies","strawberries","Mini strawberries","Avocado","Hass avocado","Not Avocado")
mat <- stringdist::stringdistmatrix(roles,roles,method = "jw",p=0.025,nthread = parallel::detectCores())
colnames(mat) <- roles
rownames(mat) <- roles
t <- hclust(as.dist(mat),method = "single")
memb <- cutree(t,h=th)
df <- data.frame(a=c(roles),b=c(memb),stringsAsFactors = F)
df$to <- plyr::mapvalues(df$b,from=1:length(unique(memb)),to=df$a[!duplicated(df$b)])
prior <- data.frame(str=roles,to=df$to,stringsAsFactors = F)
prior
str to
1 Strawberies Strawberies
2 strawberries Strawberies
3 Mini strawberries Strawberies
4 Avocado Avocado
5 Hass avocado Avocado
6 Not Avocado Avocado