I have to compare one spacy document to a list of spacy documents and want to get a list of similarity scores as an output. Of course, I can do this using a for loop, but I'm looking for some optimized solution like numpy offers to broadcast etc.
I have one document against a list of documents:
oneDoc = 'Hello, I want to be compared with a list of documents'
listDocs = ["I'm the first one", "I'm the second one"]
spaCy offers us a document similarity function:
oneDoc = nlp(oneDoc)
listDocs = nlp(listDocs)
similarity_score = np.zeros(len(listDocs))
for i, doc in enumerate(listDocs):
similarity_score[i] = oneDoc.similarity(doc)
Since one document is compared with a list of two documents, the similarity score would be like this:
[0.7, 0.8]
I'm looking for a way to avoid this for loop. In other words, I want to vectorize this function.
Use nlp.pipe
to process all of your text documents. Grab the embeddings .vector
from each document. Apply numpy pairwise distance function with cosine as metric to create matrix.