Search code examples
rstringreceipt

Group similar strings and change their values to something common while retaining the individual rows


I have receipt data and there are descriptions of items but some are pretty similar and I'd like to code those similar ones with he same value to increase the chances of finding associations in the data. For example:

Strawberries
Premium Strawberries
Premium Strawberries 
Hass Avocado
Mini Avocado

I'd like to have:

Strawberries
Strawberries
Strawberries
Avocado
Avocado

Something to that effect but I'm open to suggestions for sure.All I can think of is that some sort of fuzzy search might be what I need I just don't know how to implement that?

Thanks, once again!


Solution

  • One possible way are string distances. But be careful because they do not capture any meaning, just the similarity between actual strings. Below example could work like some heuristic, but pay attention to last example. Higher the threshold in cutree, less groups you will have, and probably more wrongly classified examples. Ergo lower threshold means that you are more strict, and possibly missing good solutions:

    th <- 0.35 ## between 0 and 1
    roles <- c("Strawberies","strawberries","Mini strawberries","Avocado","Hass avocado","Not Avocado")
    mat <- stringdist::stringdistmatrix(roles,roles,method = "jw",p=0.025,nthread = parallel::detectCores())
    colnames(mat) <- roles
    rownames(mat) <- roles
    t <- hclust(as.dist(mat),method = "single")
    memb <- cutree(t,h=th) 
    df <- data.frame(a=c(roles),b=c(memb),stringsAsFactors = F)
    df$to <- plyr::mapvalues(df$b,from=1:length(unique(memb)),to=df$a[!duplicated(df$b)])
    
    prior <- data.frame(str=roles,to=df$to,stringsAsFactors = F)
    prior
                    str          to
    1       Strawberies Strawberies
    2      strawberries Strawberies
    3 Mini strawberries Strawberies
    4           Avocado     Avocado
    5      Hass avocado     Avocado
    6       Not Avocado     Avocado