Search code examples
algorithmlanguage-agnosticstring-comparisonstring-matchinglevenshtein-distance

Getting the closest string match (with possibly very different string sizes)


I am looking for a way to find the closest string match between two strings that could eventually have a very different size. Let's say I have, on the one hand, a list of possible locations like:

Yosemite National Park

Yosemite Valley

Yosemite National Park Lodge

Yosemite National Park Visitor Center

San Francisco

Golden Gate Park San Francisco

Paris

New York

Manhattan New York

Hong Kong

On the other hand, I have multiple sentences like:

  1. "I proposed to my wife on the 12th of November 1984, during a crazy downpour in the middle of Yosemite in California"
  2. "I love to walk my dog in Central Park, New York"
  3. "I love Hong Kong"

Now say I would like to extract the location from these set of sentences I would I proceed to do that? I know about the Levenshtein distance algorithm but I'm not quite sure it will work efficiently here, especially because I have many more locations and many more sentences to try and match. I guess what I would love to have is a matching score of some sort for each location so that I can pick the one with the highest score, but I have no idea on how to compute this score.

Do you guys have any idea of how to do that? Or perhaps even an implementation or python package?

Thanks in advance


Solution

  • For jobs like this, you'd typically use a pipeline of processing something on this general order:

    1. remove "noise" words (aka "stop words") like "a", "an", "the", "is", and so on. If you look around a bit, you can find various lists of stop words to filter out.
    2. create a vector space model of each "document" in your corpus.
    3. Create a vector space model of a query.
    4. compute something like the TF-IDF or cosine distance between a query vector and each candidate document vector.
    5. Choose the highest score as representing the most likely match.

    References

    1. Onix stop word list
    2. NLTK stop word list
    3. vector space model
    4. cosine similarity
    5. cosine similarity
    6. tf-idf
    7. tf-idf vs. cosine similarity

    I should probably add that this sort of pipeline is more often used when you have a much larger number of documents, and each document individually is considerably larger as well. Since the "documents" and "queries" are represented exactly the same way, it's also useful/usable for cases where you want to categorize and group documents--that is, find how similar documents are to each other.