I do a string matching using TF-IDF and Cosine Similarity and it's working good for finding the similarity between strings in a list of strings.
Now, I want to do the matching between a new string against the previously calculated matrix. I calculate the TF-IDF score using below code.
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(min_df=1, analyzer=ngrams)
tf_idf_matrix = vectorizer.fit_transform(list_string)
How can I calculate the TF-IDF score of a new string against previous matrix? I can add the new string to the series and recalculate the matrix like below, but it will be inefficient since I only want the last index of the matrix and don't need the matrix of the old series to be recalculated.
list_string = list_string.append(new_string)
single_matrix = vectorizer.fit_transform(list_string)
single_matrix = single_matrix[len(list_string) - 1:]
After reading a while about TF-IDF calculation, I am thinking about saving the IDF value of each term and manually calculate the TF-IDF of the new string without using the matrix, but I don't know how to do that. How can I do this? Or is there any better way?
Refitting the TF-IDF in order to calculate the score of a single entry is not the way; you should simply use the .transform()
method of the existing fitted vectorizer to your new string (not to the whole matrix):
single_entry = vectorizer.transform(new_string)
See the docs.