After preprocessing and transforming (BOW, TF-IDF) data I need to calculate its cosine similarity with each other element of the dataset. Currently, I do this:
cs_title = [cosine_similarity(a, b) for a in tr_title for b in tr_title]
cs_abstract = [cosine_similarity(a, b) for a in tr_abstract for b in tr_abstract]
cs_mesh = [cosine_similarity(a, b) for a in pre_mesh for b in pre_mesh]
cs_pt = [cosine_similarity(a, b) for a in pre_pt for b in pre_pt]
In this example, each input variable, eg tr_title
, is a SciPy sparse matrix. However, this code runs extremely slowly. What can I do to optimise the code so it runs more quickly?
To improve performance you should replace the list comprehensions by vectorized code. This can be easily implemented through Numpy's pdist
and squareform
as shown in the snippet below:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from scipy.spatial.distance import pdist, squareform
titles = [
'A New Hope',
'The Empire Strikes Back',
'Return of the Jedi',
'The Phantom Menace',
'Attack of the Clones',
'Revenge of the Sith',
'The Force Awakens',
'A Star Wars Story',
'The Last Jedi',
]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(titles)
cs_title = squareform(pdist(X.toarray(), 'cosine'))
Demo:
In [87]: X
Out[87]:
<9x21 sparse matrix of type '<type 'numpy.int64'>'
with 30 stored elements in Compressed Sparse Row format>
In [88]: X.toarray()
Out[88]:
array([[0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0],
[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0],
[1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0],
[0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1],
[0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0]], dtype=int64)
In [89]: vectorizer.get_feature_names()
Out[89]:
[u'attack',
u'awakens',
u'back',
u'clones',
u'empire',
u'force',
u'hope',
u'jedi',
u'last',
u'menace',
u'new',
u'of',
u'phantom',
u'return',
u'revenge',
u'sith',
u'star',
u'story',
u'strikes',
u'the',
u'wars']
In [90]: np.set_printoptions(precision=2)
In [91]: print(cs_title)
[[ 0. 1. 1. 1. 1. 1. 1. 1. 1. ]
[ 1. 0. 0.75 0.71 0.75 0.75 0.71 1. 0.71]
[ 1. 0.75 0. 0.71 0.5 0.5 0.71 1. 0.42]
[ 1. 0.71 0.71 0. 0.71 0.71 0.67 1. 0.67]
[ 1. 0.75 0.5 0.71 0. 0.5 0.71 1. 0.71]
[ 1. 0.75 0.5 0.71 0.5 0. 0.71 1. 0.71]
[ 1. 0.71 0.71 0.67 0.71 0.71 0. 1. 0.67]
[ 1. 1. 1. 1. 1. 1. 1. 0. 1. ]
[ 1. 0.71 0.42 0.67 0.71 0.71 0.67 1. 0. ]]
Notice that X.toarray().shape
yields (9L, 21L)
because in the toy example above there are 9 titles and 21 different words, whereas cs_title
is a 9 by 9 array.