I am working on Entity Extraction in R.
I have a UniqueID
and Text
field - need to extract location information from the text field.
My Text field has description with location names
text <- c("SERANGOON JC","Blk 4","SHELL TAMPINES AVE 4","SENOKO INDUSTRIAL ESTATE","Senoko Estate","Senoko","senok Est.")
I have a list of Locations ;
Loc <- c("SERANGOON JUNIOR COLLEGE","Block 4","SHELL TAMPINES AVENUE 4","SENOKO INDUSTRIAL ESTATE")
Need to match the loc
and extract those location from the text
field.In the text field SENOKO INDUSTRIAL ESTATE
is spelt in different ways Senoko Estate
or Senoko
(Half Names) or with spelling mistake senok Est.
.for all the above mis-spelt and half spelt words - i need to get the exact name from loc
ie. SENOKO INDUSTRIAL ESTATE
.
My output would look like:(Extract location from Text field -get correct words for half- spelt and misspelt words)
ID Location
123 SERANGOON JUNIOR COLLEGE|Block 4|SHELL TAMPINES AVENUE 4|SENOKO INDUSTRIAL ESTATE|SENOKO INDUSTRIAL ESTATE|SENOKO INDUSTRIAL ESTATE|SENOKO INDUSTRIAL ESTATE
I don't think this is the prettiest way to answer it, but..
text <- c("SERANGOON JC","Blk 4","SHELL TAMPINES AVE 4","SENOKO INDUSTRIAL ESTATE","Senoko Estate","Senoko","senok Est.")
Loc <- c("SERANGOON JUNIOR COLLEGE","Block 4","SHELL TAMPINES AVENUE 4","SENOKO INDUSTRIAL ESTATE")
text <- gsub(".*serang.*", "SERANGOON JUNIOR COLLEGE", text, ignore.case=TRUE)
text <- gsub(".*bl.* 4.*", "Block 4", text, ignore.case=TRUE)
text <- gsub(".*shell.*", "SHELL TAMPINES AVENUE 4", text, ignore.case=TRUE)
text <- gsub(".*senok.*", "SENOKO INDUSTRIAL ESTATE", text, ignore.case=TRUE)
print(text)
I didn't put it exactly in the format you requested, but that would be the contents of the second column (aka Location). I used the regex expression ".*" before and after the strings you were looking for in case there are other possibilities/typos. This would make it more robust.
Hope this helps!