I have a list of addresses. These addresses were input by various users and hence there are lot of differences in the way a same address is written. For example,
"andheri at weh pump house", "andheri pump house","andheri pump house(mt)","weh andheri pump house","weh andheri pump house et","weh, nr. pump house"
The above vector has 6 addresses. And almost all of them are the same. I am trying to find the matches between these address, so that I can club them together and recode them.
I have tried using agrep
and stringdist package. With agrep I am not sure if I should each address as a pattern and match it against the rest. And from the stringdist package I did the following:
library(stringdist)
nsrpatt <- df$Address
x <- scan(what=character(), text = nsrpatt, sep=",")
x <- x[trimws(x)!= ""]
y <- ave(x, phonetic(x), FUN = function(.x) .x[1])
The above gives me the error:
In phonetic(x) : soundex encountered 111 non-printable ASCII or non-ASCII
characters.
Not sure if I should remove those elements from the character vector or convert them to some other format.
With agrep I tried:
for (i in 1:length(nsrpattn)) {
npat <- agrep(nsrpattn[i], df$address, max=1, v=T)
}
The length of the character vector is around 25000 and this keeps running and stalls the machine.
How do I effectively find the closest match for each one of the address.
You could run a small cluster analysis on your data.
x <- c("wall street", "Wall-street", "Wall ST", "andheri pump house",
"weh, nr. pump house", "Wallstreet", "weh andheri pump house",
"Wall Street", "weh andheri pump house et", "andheri at weh pump house",
"andheri pump house(mt)")
First, you need a distance matrix.
# Levenstein Distance
e <- adist(na.omit(tolower(x)))
rownames(e) <- na.omit(x)
Then, a cluster analysis can be run.
hc <- hclust(as.dist(e)) # find distance clusters
Derive the best cutpoint, e.g. graphically, and "cut the tree".
plot(hc)
# cut tree at specific cluster size, i.e. getting codes of similar objects
smly <- cutree(hc, h=16)
Then you can build a key data frame, which which you can check if the matches are right.
key <- data.frame(x=na.omit(x),
smly=factor(smly, labels=c("Wall Street", "Andheri Pump House")),
row.names=NULL) # key data frame
key
# x smly
# 1 wall street Wall Street
# 2 Wall-street Wall Street
# 3 Wall ST Wall Street
# 4 andheri pump house Andheri Pump House
# 5 weh, nr. pump house Andheri Pump House
# 6 Wallstreet Wall Street
# 7 weh andheri pump house Andheri Pump House
# 8 Wall Street Wall Street
# 9 weh andheri pump house et Andheri Pump House
# 10 andheri at weh pump house Andheri Pump House
# 11 andheri pump house(mt) Andheri Pump House
Finally replace your vector like so:
x <- key$smly