Search code examples
pythonscikit-learntext-processingtf-idftfidfvectorizer

Calculating TF-IDF Score of a Single String


I do a string matching using TF-IDF and Cosine Similarity and it's working good for finding the similarity between strings in a list of strings.

Now, I want to do the matching between a new string against the previously calculated matrix. I calculate the TF-IDF score using below code.

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(min_df=1, analyzer=ngrams)
tf_idf_matrix = vectorizer.fit_transform(list_string)

How can I calculate the TF-IDF score of a new string against previous matrix? I can add the new string to the series and recalculate the matrix like below, but it will be inefficient since I only want the last index of the matrix and don't need the matrix of the old series to be recalculated.

list_string = list_string.append(new_string)

single_matrix = vectorizer.fit_transform(list_string)

single_matrix = single_matrix[len(list_string) - 1:]

After reading a while about TF-IDF calculation, I am thinking about saving the IDF value of each term and manually calculate the TF-IDF of the new string without using the matrix, but I don't know how to do that. How can I do this? Or is there any better way?


Solution

  • Refitting the TF-IDF in order to calculate the score of a single entry is not the way; you should simply use the .transform() method of the existing fitted vectorizer to your new string (not to the whole matrix):

    single_entry = vectorizer.transform(new_string)
    

    See the docs.