Search code examples
regexalgorithmsimilaritylevenshtein-distancehamming-distance

Algorithm to match sentences about the same topic


I've been researching different algorithms, but haven't found exactly what I'm looking for.

Hamming distance (Only good for strings of the same length) Levenstein distance (finds similar words like kitten and sitten)

What I'm looking for is something that would find sentences about the same idea.

For example:

Sentence 1: Josh got hurt while playing in the park.
Sentence 2: Josh fell off the slide and got hurt at the park.
Sentence 3: Be careful at the park, your kids could get hurt.
Sentence 4: Josh likes to go shopping.

What I'm looking for would consider

sentence 1 and 2 on topic, but not sentence 3 or 4.

I guess I could try to compare each word in the sentence?

I would greatly appreciate anyone who could point me in the right direction.


Solution

  • In general you would need to use some natural language processing (NLP). If you are new to the subject, I recommend you to take a look at nltk. It is a python library that includes tools for a variety of NLP problems. They also have a free book that you can check to take a quick look at the tools that you may need.

    www.nltk.org/book/‎

    I hope it helps