Search code examples
pythonscikit-learncosine-similaritytfidfvectorizer

Re-calculate similarity matrix given new documents


I'm running an experiment that include text documents that I need to calculate the (cosine) similarity matrix between all of them (to use for another calculation). For that I use sklearn's TfidfVectorizer:

corpus = [doc1, doc2, doc3, doc4]
vect = TfidfVectorizer(min_df=1, stop_words="english", use_idf=False) 
tfidf = vect.fit_transform(corpus)
similarities = tfidf * tfidf.T
pairwise_similarity_matrix = similarities.A

The problem is that with each iteration of my experiment I discover new documents that I need to add to my similarity matrix, and given the number of documents I'm working with (tens of thousands and more) - it is very time consuming.

I wish to find a way to calculate only the similarities between the new batch of documents and the existing ones, without computing it all again one the entire data set.

Note that I'm using a term-frequency (tf) representation, without using inverse-document-frequency (idf), so in theory I don't need to re-calculate the whole matrix each time.


Solution

  • OK, I got it. The idea is, as I said, to calculate the similarity only between the new batch of files and the existing ones, which their similarity is unchanged. The problem is to keep the TfidfVectorizer's vocabulary updated with the newly seen terms.

    The solution has 2 steps:

    1. Update the vocabulary and the tf matrices.
    2. Matrix multiplications and stacking.

    Here's the whole script - we first got the original corpus and the trained and calculated objects and matrices:

    corpus = [doc1, doc2, doc3]
    # Build for the first time:
    vect = TfidfVectorizer(min_df=1, stop_words="english", use_idf=False) 
    tf_matrix = vect.fit_transform(corpus)
    similarities = tf_matrix * tf_matrix.T
    similarities_matrix = similarities.A # just for printing
    

    Now, given new documents:

    new_docs_corpus = [docx, docy, docz] # New documents
    # Building new vectorizer to create the parsed vocabulary of the new documents:
    new_vect = TfidfVectorizer(min_df=1, stop_words="english", use_idf=False) 
    new_vect.fit(new_docs_corpus)
    
    # Merging old and new vocabs:
    new_terms_count = 0
    for k, v in new_vect.vocabulary_.items():
        if k in vect.vocabulary_.keys():
            continue
        vect.vocabulary_[k] = np.int64(len(vect.vocabulary_)) # important not to assign a simple int
        new_terms_count = new_terms_count + 1
    new_vect.vocabulary_ = vect.vocabulary_
    
    # Build new docs represantation using the merged vocabulary:
    new_tf_matrix = new_vect.transform(new_docs_corpus)
    new_similarities = new_tf_matrix * new_tf_matrix.T
    
    # Get the old tf-matrix with the same dimentions:
    if new_terms_count:
        zero_matrix = csr_matrix((tfidf.shape[0],new_terms_count))
        tf_matrix = hstack([tf_matrix, zero_matrix])
    # tf_matrix = vect.transform(corpus) # Instead, we just append 0's for the new terms and stack the tf_matrix over the new one, to save time
    cross_similarities = new_tf_matrix * tf_matrix.T # Calculate cross-similarities
    tf_matrix = vstack([tf_matrix, new_tfidf])
    # Stack it all together:
    similarities = vstack([hstack([similarities, cross_similarities.T]), hstack([cross_similarities, new_similarities])])
    similarities_matrix = similarities.A
    
    # Updating the corpus with the new documents:
    corpus = corpus + new_docs_corpus
    

    We can check this by comparing the calculated similarities_matrix we got, with the one we get when we train a TfidfVectorizer on the joint corpus: corpus + new_docs_corpus.

    As discussed in the the comments, we can do all that only because we are not using the idf (inverse-document-frequency) element, that will change the representation of existing documents given new ones.