Search code examples
tensorflownlpsimilarityspacysentence-similarity

Similarity between two lists of documents


I need to find the similarity between two lists of the short texts in Python. Texts can be 1-4 word long. The length of the lists can be 10K each. So, I need to effectively calculate 10K*10K=100M similarity scores. I didn't find how to do this effectively in spaCy. Maybe other packages can do this? I assume the words are represented by a vector (300d), but any other options are also Ok. This task can be done in a cycle, but there should be a more effective way for sure. This task fits the TensorFlow, pyTorch, and similar packages, but I'm not familiar with details of these packages.


Solution

  • The solution was to use something like Spotify Annoy which uses Approximate Nearest Neighbours method. There are some other libraries to do the nearest neighbour search.