Search code examples
machine-learningtf-idf

TF-IDF Vectors can be generated at different levels of input tokens (words, characters, n-grams) which should we use?


a. Word Level TF-IDF : Matrix representing tf-idf scores of every term in different documents.

b. N-gram Level TF-IDF : N-grams are the combination of N terms together. This Matrix representing tf-idf scores of N-grams

c. Character Level TF-IDF : Matrix representing tf-idf scores of character level

tfidf_vect = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', max_features=5000)
tfidf_vect.fit(trainDF['texts'])
xtrain_tfidf = tfidf_vect.transform(train_x)
xvalid_tfidf = tfidf_vect.transform(valid_x)


# ngram level tf-idf N-gram Level TF-IDF : N-grams are the combination of N terms together. This 
Matrix representing tf-idf scores of N-grams
tfidf_vect_ngram = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', ngram_range=(2, 3), 
max_features=5000)
tfidf_vect_ngram.fit(trainDF['texts'])
xtrain_tfidf_ngram = tfidf_vect_ngram.transform(train_x)
xvalid_tfidf_ngram = tfidf_vect_ngram.transform(valid_x)


# characters level tf-idf Character Level TF-IDF : Matrix representing tf-idf scores of character level n-grams in the dataset
tfidf_vect_ngram_chars = TfidfVectorizer(analyzer='char', token_pattern=r'\w{1,}', ngram_range=(2, 3), max_features=5000)
tfidf_vect_ngram_chars.fit(trainDF['texts'])
xtrain_tfidf_ngram_chars = tfidf_vect_ngram_chars.transform(train_x)
xvalid_tfidf_ngram_chars = tfidf_vect_ngram_chars.transform(valid_x)

Solution

  • There is no one right answer for all cases. The approach will depend on the nature of the data.

    You should use GridSearchCV to recognize the best approach in your exact case. Here is a good example of the pipeline for text feature extraction from the official documentation.