Search code examples
pythontensorflowvectortf-idftfidfvectorizer

TF-IDF Vectors Example (HELP)


Hey i made 3 different approaches but i can't decide which is the right way to use TF-IDF:

The first code does fit and transform to both x_train and x_test separately giving (5000, 94462) (5000, 93007).

The second code uses both train and test which i think is not right because idf is calculated based on the training documents only, giving (5000, 152800) (5000, 152800).

The third code gives (5000, 94462) (5000, 94462).

For me the third code is right because i used train data only and transform test data based on them.

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
xtrain_tfidf = vectorizer.fit_transform(x_train)
xtest_tfidf = vectorizer.fit_transform(x_test)
print(xtrain_tfidf.shape)
print(xtest_tfidf.shape)

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
vectorizer.fit(x_train+x_test)
xtrain_tfidf = vectorizer.transform(x_train)
xtest_tfidf = vectorizer.transform(x_test)
print(xtrain_tfidf.shape)
print(xtest_tfidf.shape)

from sklearn.feature_extraction.text import TfidfVectorizer
vect = TfidfVectorizer()
vect.fit(x_train)
x_test_vectorized = vect.transform(x_test)


Solution

  • The right way is to fit and transform == fit_transform your training data and only transform test data.

    
    from sklearn.feature_extraction.text import TfidfVectorizer
    
    vectorizer = TfidfVectorizer()
    xtrain_tfidf = vectorizer.fit_transform(x_train)
    xtest_tfidf = vectorizer.transform(x_test)
    print(xtrain_tfidf.shape)
    print(xtest_tfidf.shape)
    
    

    You never fit_transform test data.