Search code examples
rpattern-matchingstringdistagrep

R Finding elements matching with each other within a vector


I have a list of addresses. These addresses were input by various users and hence there are lot of differences in the way a same address is written. For example,

"andheri at weh pump house", "andheri pump house","andheri pump house(mt)","weh andheri pump house","weh andheri pump house et","weh, nr. pump house" 

The above vector has 6 addresses. And almost all of them are the same. I am trying to find the matches between these address, so that I can club them together and recode them.

I have tried using agrep and stringdist package. With agrep I am not sure if I should each address as a pattern and match it against the rest. And from the stringdist package I did the following:

library(stringdist)
nsrpatt <- df$Address
x <- scan(what=character(), text = nsrpatt, sep=",")
x <- x[trimws(x)!= ""]
y <- ave(x, phonetic(x), FUN = function(.x) .x[1])

The above gives me the error:

In phonetic(x) : soundex encountered 111 non-printable ASCII or non-ASCII
  characters. 

Not sure if I should remove those elements from the character vector or convert them to some other format.

With agrep I tried:

for (i in 1:length(nsrpattn)) {
  npat <- agrep(nsrpattn[i], df$address, max=1, v=T)
}

The length of the character vector is around 25000 and this keeps running and stalls the machine.

How do I effectively find the closest match for each one of the address.


Solution

  • You could run a small cluster analysis on your data.

    x <- c("wall street", "Wall-street", "Wall ST", "andheri pump house", 
           "weh, nr. pump house", "Wallstreet", "weh andheri pump house", 
           "Wall Street", "weh andheri pump house et", "andheri at weh pump house", 
           "andheri pump house(mt)")
    

    First, you need a distance matrix.

    # Levenstein Distance
    e  <- adist(na.omit(tolower(x)))
    rownames(e) <- na.omit(x)
    

    Then, a cluster analysis can be run.

    hc <- hclust(as.dist(e))  # find distance clusters
    

    Derive the best cutpoint, e.g. graphically, and "cut the tree".

    plot(hc)
    

    enter image description here

    # cut tree at specific cluster size, i.e. getting codes of similar objects
    smly <- cutree(hc, h=16)
    

    Then you can build a key data frame, which which you can check if the matches are right.

    key <- data.frame(x=na.omit(x), 
                      smly=factor(smly, labels=c("Wall Street", "Andheri Pump House")),
                      row.names=NULL)  # key data frame
    key
    #                            x               smly
    # 1                wall street        Wall Street
    # 2                Wall-street        Wall Street
    # 3                    Wall ST        Wall Street
    # 4         andheri pump house Andheri Pump House
    # 5        weh, nr. pump house Andheri Pump House
    # 6                 Wallstreet        Wall Street
    # 7     weh andheri pump house Andheri Pump House
    # 8                Wall Street        Wall Street
    # 9  weh andheri pump house et Andheri Pump House
    # 10 andheri at weh pump house Andheri Pump House
    # 11    andheri pump house(mt) Andheri Pump House
    

    Finally replace your vector like so:

    x <- key$smly