python machine-learning scikit-learn sparse-matrix fuzzy-logic

How fit_transform, transform and TfidfVectorizer works

I'm working on a fuzzy matching project and I have found a very interesting method : awesome_cossim_top

I globally understood the definition but do not understand what is happening when we do fit_transform

import pandas as pd
import sqlite3 as sql
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
from scipy.sparse import csr_matrix
import sparse_dot_topn.sparse_dot_topn as ct
import re

def ngrams(string, n=3):
    string = re.sub(r'[,-./]|\sBD',r'', re.sub(' +', ' ',str(string)))
    ngrams = zip(*[string[i:] for i in range(n)])
    return [''.join(ngram) for ngram in ngrams]

def awesome_cossim_top(A, B, ntop, lower_bound=0):
    # force A and B as a CSR matrix.
    # If they have already been CSR, there is no overhead
    A = A.tocsr()
    B = B.tocsr()
    M, _ = A.shape
    _, N = B.shape

    idx_dtype = np.int32

    nnz_max = M*ntop

    indptr = np.zeros(M+1, dtype=idx_dtype)
    indices = np.zeros(nnz_max, dtype=idx_dtype)
    data = np.zeros(nnz_max, dtype=A.dtype)

    ct.sparse_dot_topn(
            M, N, np.asarray(A.indptr, dtype=idx_dtype),
            np.asarray(A.indices, dtype=idx_dtype),
            A.data,
            np.asarray(B.indptr, dtype=idx_dtype),
            np.asarray(B.indices, dtype=idx_dtype),
            B.data,
            ntop,
            lower_bound,
            indptr, indices, data)

    print('ct.sparse_dot_topn: ', ct.sparse_dot_topn)
    return csr_matrix((data,indices,indptr),shape=(M,N))

    def get_matches_df(sparse_matrix, A, B, top=100):
        non_zeros = sparse_matrix.nonzero()

        sparserows = non_zeros[0]
        sparsecols = non_zeros[1]

        if top:
            nr_matches = top
        else:
            nr_matches = sparsecols.size

        left_side = np.empty([nr_matches], dtype=object)
        right_side = np.empty([nr_matches], dtype=object)
        similairity = np.zeros(nr_matches)

        for index in range(0, nr_matches):
            left_side[index] = A[sparserows[index]]
            right_side[index] = B[sparsecols[index]]
            similairity[index] = sparse_matrix.data[index]

        return pd.DataFrame({'left_side': left_side,
                             'right_side': right_side,
                             'similairity': similairity})

Here is the script where I meet the confusion: Why should we use first fit_transform and then transform only with the SAME vectorizer. I tried to print a few output from vectorizer and matrix like print(vectorizer.get_feature_names()) but do not understand the logic.

Is anyone can help me to clarify ?

Thanks a lot !!

Col_clean = 'fruits_normalized'
Col_dirty = 'fruits'

#read table
data_dirty={f'{Col_dirty}':['I am an apple', 'You are an apple', 'Aple', 'Appls', 'Apples']}
data_clean= {f'{Col_clean}':['apple', 'pear', 'banana', 'apricot', 'pineapple']}

df_clean = pd.DataFrame(data_clean)
df_dirty = pd.DataFrame(data_dirty)

Name_clean = df_clean[f'{Col_clean}'].unique()
Name_dirty= df_dirty[f'{Col_dirty}'].unique()

vectorizer = TfidfVectorizer(min_df=1, analyzer=ngrams)
clean_idf_matrix = vectorizer.fit_transform(Name_clean)
dirty_idf_matrix = vectorizer.transform(Name_dirty)

matches = awesome_cossim_top(dirty_idf_matrix, clean_idf_matrix.transpose(),1,0)
matches_df = get_matches_df(matches, Name_dirty, Name_clean, top = 0)

with pd.option_context('display.max_rows', None, 'display.max_columns', None):
    matches_df.to_excel("output_apple.xlsx")

print('done')

Solution

TfidfVectorizer.fit_transform is used to create vocabulary from the training dataset and TfidfVectorizer.transform is used to map that vocabulary to test dataset so that the number of features in test data remain same as train data. Below example might help:

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

Create a dummy training data:

train = pd.DataFrame({'Text' :['I am a data scientist','Cricket is my favorite sport', 'I work on Python regularly', 'Python is very fast for data mining', 'I love playing cricket'],
                      'Category' :['Data_Science','Cricket','Data_Science','Data_Science','Cricket']})

And a small test data:

test = pd.DataFrame({'Text' :['I am new to data science field', 'I play cricket on weekends', 'I like writing Python codes'],
                         'Category' :['Data_Science','Cricket','Data_Science']})

Create a TfidfVectorizer() object called vectorizer

vectorizer = TfidfVectorizer()

Fit it on the train data

X_train = vectorizer.fit_transform(train['Text'])
print(vectorizer.get_feature_names())

#['am', 'cricket', 'data', 'fast', 'favorite', 'for', 'is', 'love', 'mining', 'my', 'on', 'playing', 'python', 'regularly', 'scientist', 'sport', 'very', 'work']

feature_names = vectorizer.get_feature_names()
df= pd.DataFrame(X.toarray(),columns=feature_names)

Now see what happens if you do the same on test dataset:

vectorizer_test = TfidfVectorizer()
X_test = vectorizer_test.fit_transform(test['Text'])
print(vectorizer_test.get_feature_names())

#['am', 'codes', 'cricket', 'data', 'field', 'like', 'new', 'on', 'play', 'python', 'science', 'to', 'weekends', 'writing']
feature_names_test = vectorizer_test.get_feature_names()
df_test= pd.DataFrame(X_test.toarray(),columns = feature_names_test)

It has created another vocabulary with test dataset, which has 14 unique words(columns) comparing to 18 words(columns) from train data.

Now if you train a Machine Learning algorithm on your train data for text-classification and try to make predictions on your matrix from test data, it will fail and generate an error that features are different between the train and test data.

To overcome this error we do something like this in text-classification:

X_test_from_train = vectorizer.transform(test['Text'])
feature_names_test_from_train = vectorizer.get_feature_names()
df_test_from_train = pd.DataFrame(X_test_from_train.toarray(),columns = feature_names_test_from_train)

Here you would have noticed that we didn't use the fit_transform command rather we used transform on test data, the reason is same that while making the predictions on test data, we only want to use the features which are similar in both train and test data so that we don't have feature mismatch error.

Hope this helps!!