Search code examples
pythonscikit-learntf-idfcosine-similarity

Using sklearn how do I calculate the tf-idf cosine similarity between documents and a query?


My goal is to input 3 queries and find out which query is most similar to a set of 5 documents.

So far I have calculated the tf-idf of the documents doing the following:

from sklearn.feature_extraction.text import TfidfVectorizer

def get_term_frequency_inverse_data_frequency(documents):
    allDocs = []
    for document in documents:
        allDocs.append(nlp.clean_tf_idf_text(document))
    vectorizer = TfidfVectorizer()
    matrix = vectorizer.fit_transform(allDocs)
    return matrix

def get_tf_idf_query_similarity(documents, query):
    tfidf = get_term_frequency_inverse_data_frequency(documents)

The problem I am having is now that I have tf-idf of the documents what operations do I perform on the query so I can find the cosine similarity to the documents?


Solution

  • Here is my suggestion:

    • We don't have to fit the model twice. we could reuse the same vectorizer
    • text cleaning function can be plugged into TfidfVectorizer directly using preprocessing attribute.
    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.metrics.pairwise import cosine_similarity
    
    vectorizer = TfidfVectorizer(preprocessor=nlp.clean_tf_idf_text)
    docs_tfidf = vectorizer.fit_transform(allDocs)
    
    def get_tf_idf_query_similarity(vectorizer, docs_tfidf, query):
        """
        vectorizer: TfIdfVectorizer model
        docs_tfidf: tfidf vectors for all docs
        query: query doc
    
        return: cosine similarity between query and all docs
        """
        query_tfidf = vectorizer.transform([query])
        cosineSimilarities = cosine_similarity(query_tfidf, docs_tfidf).flatten()
        return cosineSimilarities