I have input
"I am travelling on my own, I have just brought a world ticket to go to singapore, darwin, perth, adelaide, melbourne, brisbane, gold cost, sydney Opra, christchurch,gold coast Richland, Aukland,Austrlia, and fji. It is a 10 month journey. I will be going on my own, I am not scared but my friends and family seem to be against the idea. I have explained that it is safe and that I will probably meet people along the way and hostels are not as bad as theya re made out to be. for at least a 1/3 of my trip i will be staying with friends and family. I am excited, but people pesimistic views are making me doubt the safety. I am from the UK so will be a long way from home, and they are scared incase I get into trouble. I have never been to US"
I have a places list as big as 5000 rows. Like London, Singapore, Sydney, Aukland , Fiji,Gold Coast, Sydney Opera, Australia,UK, USA....
Problem Get the places out of the input by matching from Places List. With Spelling Mistakes and Closest Match. Optimization is required.
Output Singapore|Darwin|perth|adelaide|melbourne|brisbane|gold coast|sydney Opera|christchurch|Aukland|Austrlia|fiji|UK|USA
Tried Methods
library(RecordLinkage)
library(stringdist)
input=tolower(gsub('[[:punct:]]', " ", input))
Places <- read.delim("\\Data\\Places_List.csv", row.names =NULL,header=TRUE,sep=",")
Places <-as.matrix(Places)
##################Different Methods Tried##########################
ClosestMatch2 = function(string, stringVector){
distance = levenshteinSim(string, stringVector);
stringVector[distance == max(distance)]
}
ClosestMatch2(input,Places)
###############The above 1 doesn't Work##################
ClosestMatch <- function(string,StringVector) {
matches <- agrep(string,StringVector,value=TRUE)
distance <- sdists(string,matches,method = "",weight = c(1, 0, 2))
matches <- data.frame(matches,as.numeric(distance))
matches <- subset(matches,distance==min(distance))
as.character(matches$matches)
}
ClosestMatch(input,Places)
########This work but not proper Results###########
k=as.matrix((sapply(input,agrep,places)))
######This didnt work either
agrep, pmatch , str_detect(wont work for spelling Mistakes) doesn't work for bigger data sets
Closest match2 works, in addition to that add the character number difference and sub string partial match for matching with spelling mistakes