I want to use TfidfVectorizer() on a file that contains many lines, each a phrase. I then want to take a test file with a small subset of phrases, do TfidfVectorizer() and then take the cosine similarity between the original and the test file so that for a given phrase in the test file, I retrieve the top N matches within the original file. Here is my attempt:
corpus = tuple(open("original.txt").read().split('\n'))
test = tuple(open("test.txt").read().split('\n'))
from sklearn.feature_extraction.text import TfidfVectorizer
tf = TfidfVectorizer(analyzer='word', ngram_range=(1,3), min_df = 0, stop_words = 'english')
tfidf_matrix = tf.fit_transform(corpus)
tfidf_matrix2 = tf.fit_transform(test)
from sklearn.metrics.pairwise import linear_kernel
def new_find_similar(tfidf_matrix2, index, tfidf_matrix, top_n = 5):
cosine_similarities = linear_kernel(tfidf_matrix2[index:index+1], tfidf_matrix).flatten()
related_docs_indices = [i for i in cosine_similarities.argsort()[::-1] if i != index]
return [(index, cosine_similarities[index]) for index in related_docs_indices][0:top_n]
for index, score in find_similar(tfidf_matrix, 1234567):
print score, corpus[index]
However I get:
for index, score in new_find_similar(tfidf_matrix2, 1000, tfidf_matrix):
print score, test[index]
Traceback (most recent call last):
File "<ipython-input-53-2bf1cd465991>", line 1, in <module>
for index, score in new_find_similar(tfidf_matrix2, 1000, tfidf_matrix):
File "<ipython-input-51-da874b8d3076>", line 2, in new_find_similar
cosine_similarities = linear_kernel(tfidf_matrix2[index:index+1], tfidf_matrix).flatten()
File "C:\Users\arron\AppData\Local\Continuum\Anaconda2\lib\site-packages\sklearn\metrics\pairwise.py", line 734, in linear_kernel
X, Y = check_pairwise_arrays(X, Y)
File "C:\Users\arron\AppData\Local\Continuum\Anaconda2\lib\site-packages\sklearn\metrics\pairwise.py", line 122, in check_pairwise_arrays
X.shape[1], Y.shape[1]))
ValueError: Incompatible dimension for X and Y matrices: X.shape[1] == 66662 while Y.shape[1] == 3332088
I wouldn't mind combining both files and then transforming, but I want to b sure I do not compare any of the phrases from the test file with in of the other phrases within the test file.
Any pointers?
Fit the TfidfVectorizer
with data from corpus, then transform the test data with the already fitted vectorizer (i.e., do not call fit_transform
twice):
tfidf_matrix = tf.fit_transform(corpus)
tfidf_matrix2 = tf.transform(test)