python scikit-learn tf-idf cosine-similarity

Python: compare items within two different tfidf matrices of different dimensions

I want to use TfidfVectorizer() on a file that contains many lines, each a phrase. I then want to take a test file with a small subset of phrases, do TfidfVectorizer() and then take the cosine similarity between the original and the test file so that for a given phrase in the test file, I retrieve the top N matches within the original file. Here is my attempt:

corpus = tuple(open("original.txt").read().split('\n'))
test = tuple(open("test.txt").read().split('\n'))


from sklearn.feature_extraction.text import TfidfVectorizer

tf = TfidfVectorizer(analyzer='word', ngram_range=(1,3), min_df = 0, stop_words = 'english')
tfidf_matrix =  tf.fit_transform(corpus)
tfidf_matrix2 =  tf.fit_transform(test)

from sklearn.metrics.pairwise import linear_kernel 


def new_find_similar(tfidf_matrix2, index, tfidf_matrix, top_n = 5):
    cosine_similarities = linear_kernel(tfidf_matrix2[index:index+1], tfidf_matrix).flatten()
    related_docs_indices = [i for i in cosine_similarities.argsort()[::-1] if i != index]
    return [(index, cosine_similarities[index]) for index in related_docs_indices][0:top_n]


for index, score in find_similar(tfidf_matrix, 1234567):
       print score, corpus[index]

However I get:

for index, score in new_find_similar(tfidf_matrix2, 1000, tfidf_matrix):
       print score, test[index]
Traceback (most recent call last):

  File "<ipython-input-53-2bf1cd465991>", line 1, in <module>
    for index, score in new_find_similar(tfidf_matrix2, 1000, tfidf_matrix):

  File "<ipython-input-51-da874b8d3076>", line 2, in new_find_similar
    cosine_similarities = linear_kernel(tfidf_matrix2[index:index+1], tfidf_matrix).flatten()

  File "C:\Users\arron\AppData\Local\Continuum\Anaconda2\lib\site-packages\sklearn\metrics\pairwise.py", line 734, in linear_kernel
    X, Y = check_pairwise_arrays(X, Y)

  File "C:\Users\arron\AppData\Local\Continuum\Anaconda2\lib\site-packages\sklearn\metrics\pairwise.py", line 122, in check_pairwise_arrays
    X.shape[1], Y.shape[1]))

ValueError: Incompatible dimension for X and Y matrices: X.shape[1] == 66662 while Y.shape[1] == 3332088

I wouldn't mind combining both files and then transforming, but I want to b sure I do not compare any of the phrases from the test file with in of the other phrases within the test file.

Any pointers?

Solution

Fit the TfidfVectorizer with data from corpus, then transform the test data with the already fitted vectorizer (i.e., do not call fit_transform twice):

tfidf_matrix =  tf.fit_transform(corpus)
tfidf_matrix2 =  tf.transform(test)