Search code examples
pythonnlpvectorizationsimilarityspacy

How to run spaCy's sentence similarity function to an array of strings to get an array of scores?


I have to compare one spacy document to a list of spacy documents and want to get a list of similarity scores as an output. Of course, I can do this using a for loop, but I'm looking for some optimized solution like numpy offers to broadcast etc.

I have one document against a list of documents:

oneDoc = 'Hello, I want to be compared with a list of documents'
listDocs = ["I'm the first one", "I'm the second one"]

spaCy offers us a document similarity function:

oneDoc = nlp(oneDoc)
listDocs = nlp(listDocs)
similarity_score = np.zeros(len(listDocs))
for i, doc in enumerate(listDocs):
    similarity_score[i] = oneDoc.similarity(doc)

Since one document is compared with a list of two documents, the similarity score would be like this: [0.7, 0.8]

I'm looking for a way to avoid this for loop. In other words, I want to vectorize this function.


Solution

  • Use nlp.pipe to process all of your text documents. Grab the embeddings .vector from each document. Apply numpy pairwise distance function with cosine as metric to create matrix.