Find most similar sentence in a large dataset of sentences

I currently have a text file with around a million sentences, each on a new line. I am trying to build a solution where I can take a new sentence outside of this text file and have the program return the most similar sentence present in the file.

I have found some solutions which return the pair of sentences with the highest similarity INSIDE the existing dataset.For example this one. But that is not what I am going for. I want to be able to compare a new sentence with all of those in the text file.

Also, I am not sure if I should be focusing on semantic similarity or cosine similarity.

Solution

I advise you to read about Damerau–Levenshtein distance. I was also looking for a similar solution and settled on this algorithm.

There are implementations for Python:

fastDamerauLevenshtein
pyxDamerauLevenshtein