Search code examples

R Relevant match between 2 huge data sets. Even with Spelling Mistakes

I have input

"I am travelling on my own, I have just brought a world ticket to go to singapore, darwin, perth, adelaide, melbourne, brisbane, gold cost, sydney Opra, christchurch,gold coast Richland, Aukland,Austrlia, and fji. It is a 10 month journey. I will be going on my own, I am not scared but my friends and family seem to be against the idea. I have explained that it is safe and that I will probably meet people along the way and hostels are not as bad as theya re made out to be. for at least a 1/3 of my trip i will be staying with friends and family. I am excited, but people pesimistic views are making me doubt the safety. I am from the UK so will be a long way from home, and they are scared incase I get into trouble. I have never been to US"

I have a places list as big as 5000 rows. Like London, Singapore, Sydney, Aukland , Fiji,Gold Coast, Sydney Opera, Australia,UK, USA....

Problem Get the places out of the input by matching from Places List. With Spelling Mistakes and Closest Match. Optimization is required.

Output Singapore|Darwin|perth|adelaide|melbourne|brisbane|gold coast|sydney Opera|christchurch|Aukland|Austrlia|fiji|UK|USA

Tried Methods

input=tolower(gsub('[[:punct:]]', " ", input))
Places <- read.delim("\\Data\\Places_List.csv", row.names =NULL,header=TRUE,sep=",")
Places <-as.matrix(Places)
##################Different Methods Tried##########################
ClosestMatch2 = function(string, stringVector){

distance = levenshteinSim(string, stringVector);
stringVector[distance == max(distance)]
###############The above 1 doesn't Work##################
ClosestMatch <- function(string,StringVector) {
matches <- agrep(string,StringVector,value=TRUE)
distance <- sdists(string,matches,method = "",weight = c(1, 0, 2))
matches <- data.frame(matches,as.numeric(distance))
matches <- subset(matches,distance==min(distance))
########This work but not proper Results###########
######This didnt work either
agrep, pmatch , str_detect(wont work for spelling Mistakes) doesn't work for bigger data sets


  • Closest match2 works, in addition to that add the character number difference and sub string partial match for matching with spelling mistakes