Hey i made 3 different approaches but i can't decide which is the right way to use TF-IDF:
The first code does fit and transform to both x_train and x_test separately giving (5000, 94462) (5000, 93007).
The second code uses both train and test which i think is not right because idf is calculated based on the training documents only, giving (5000, 152800) (5000, 152800).
The third code gives (5000, 94462) (5000, 94462).
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
xtrain_tfidf = vectorizer.fit_transform(x_train)
xtest_tfidf = vectorizer.fit_transform(x_test)
print(xtrain_tfidf.shape)
print(xtest_tfidf.shape)
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
vectorizer.fit(x_train+x_test)
xtrain_tfidf = vectorizer.transform(x_train)
xtest_tfidf = vectorizer.transform(x_test)
print(xtrain_tfidf.shape)
print(xtest_tfidf.shape)
from sklearn.feature_extraction.text import TfidfVectorizer
vect = TfidfVectorizer()
vect.fit(x_train)
x_test_vectorized = vect.transform(x_test)
The right way is to fit
and transform
== fit_transform
your training data and only transform
test data.
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
xtrain_tfidf = vectorizer.fit_transform(x_train)
xtest_tfidf = vectorizer.transform(x_test)
print(xtrain_tfidf.shape)
print(xtest_tfidf.shape)
You never fit_transform
test data.