Search code examples
pythonscikit-learnnlpgensimword2vec

Find most similar sentence in a large dataset of sentences


I currently have a text file with around a million sentences, each on a new line. I am trying to build a solution where I can take a new sentence outside of this text file and have the program return the most similar sentence present in the file.

I have found some solutions which return the pair of sentences with the highest similarity INSIDE the existing dataset.For example this one. But that is not what I am going for. I want to be able to compare a new sentence with all of those in the text file.

Also, I am not sure if I should be focusing on semantic similarity or cosine similarity.


Solution

  • I advise you to read about Damerau–Levenshtein distance. I was also looking for a similar solution and settled on this algorithm.

    There are implementations for Python: