Search code examples
rdictionaryreplacemisspelling

R function to correct words by frequency of more proximate word


I have a table with misspelling words. I need to correct those using from the words more similar to that one, the one that have more frequency.

For example, after I run

aggregate(CustomerID ~ Province, ventas2, length)

I get

1                             
2                     AMBA         29
    3                   BAIRES          1
    4              BENOS AIRES          1

    12            BUENAS AIRES          1

    17           BUENOS  AIRES          4
    18            buenos aires          7
    19            Buenos Aires          3
    20            BUENOS AIRES      11337
    35                 CORDOBA       2297
    36                cordoba           1
    38               CORDOBESA          1
    39              CORRIENTES        424

So I need to replace buenos aires, Buenos Aires, Baires, BUENOS AIRES, with BUENOS AIRES but AMBA shouldn't be replaced. Also CORDOBESA and cordoba should be replaced by CORDOBA, but not CORRIENTES.

How can I do this in R?

Thanks!


Solution

  • Here's a possibile solution.

    Disclaimer :
    This code seems to works fine with your current example. I don't assure that the current parameters (e.g. cut height, cluster agglomeration method, distance method etc.) will be valid for your real (complete) data.

    # recreating your data
    data <- 
    read.csv(text=
    'City,Occurr
    AMBA,29
    BAIRES,1
    BENOS AIRES,1
    BUENAS AIRES,1
    BUENOS  AIRES,4
    buenos aires,7
    Buenos Aires,3
    BUENOS AIRES,11337
    CORDOBA,2297
    cordoba,1
    CORDOBESA,1
    CORRIENTES,424',stringsAsFactors=F)
    
    
    # simple pre-processing to city strings:
    # - removing spaces
    # - turning strings to uppercase
    cities <- gsub('\\s+','',toupper(data$City))
    
    # string distance computation
    # N.B. here you can play with single components of distance costs 
    d <- adist(cities, costs=list(insertions=1, deletions=1, substitutions=1))
    # assign original cities names to distance matrix
    rownames(d) <- data$City
    # clustering cities
    hc <- hclust(as.dist(d),method='single')
    
    # plot the cluster dendrogram
    plot(hc)
    # add the cluster rectangles (just to see the clusters) 
    # N.B. I decided to cut at distance height < 5
    #      (read it as: "I consider equal 2 strings needing
    #       less than 5 modifications to pass from one to the other")
    #      Obviously you can use another value.
    rect.hclust(hc,h=4.9)
    
    # get the clusters ids
    clusters <- cutree(hc,h=4.9) 
    # turn into data.frame
    clusters <- data.frame(City=names(clusters),ClusterId=clusters)
    
    # merge with frequencies
    merged <- merge(data,clusters,all.x=T,by='City') 
    
    # add CityCorrected column to the merged data.frame
    ret <- by(merged, 
              merged$ClusterId,
              FUN=function(grp){
                    idx <- which.max(grp$Occur)
                    grp$CityCorrected <- grp[idx,'City']
                    return(grp)
                  })
    
    fixed <- do.call(rbind,ret)
    

    Result :

    > fixed
                  City Occurr ClusterId CityCorrected
    1             AMBA     29         1          AMBA
    2.2         BAIRES      1         2  BUENOS AIRES
    2.3    BENOS AIRES      1         2  BUENOS AIRES
    2.4   BUENAS AIRES      1         2  BUENOS AIRES
    2.5  BUENOS  AIRES      4         2  BUENOS AIRES
    2.6   buenos aires      7         2  BUENOS AIRES
    2.7   Buenos Aires      3         2  BUENOS AIRES
    2.8   BUENOS AIRES  11337         2  BUENOS AIRES
    3.9        cordoba      1         3       CORDOBA
    3.10       CORDOBA   2297         3       CORDOBA
    3.11     CORDOBESA      1         3       CORDOBA
    4       CORRIENTES    424         4    CORRIENTES
    

    Cluster Plot :

    enter image description here