I am looking for a way to find the closest string match between two strings that could eventually have a very different size. Let's say I have, on the one hand, a list of possible locations like:
Yosemite National Park
Yosemite Valley
Yosemite National Park Lodge
Yosemite National Park Visitor Center
San Francisco
Golden Gate Park San Francisco
Paris
New York
Manhattan New York
Hong Kong
On the other hand, I have multiple sentences like:
Now say I would like to extract the location from these set of sentences I would I proceed to do that? I know about the Levenshtein distance algorithm but I'm not quite sure it will work efficiently here, especially because I have many more locations and many more sentences to try and match. I guess what I would love to have is a matching score of some sort for each location so that I can pick the one with the highest score, but I have no idea on how to compute this score.
Do you guys have any idea of how to do that? Or perhaps even an implementation or python package?
Thanks in advance
For jobs like this, you'd typically use a pipeline of processing something on this general order:
References
I should probably add that this sort of pipeline is more often used when you have a much larger number of documents, and each document individually is considerably larger as well. Since the "documents" and "queries" are represented exactly the same way, it's also useful/usable for cases where you want to categorize and group documents--that is, find how similar documents are to each other.