How can I use TF-IDF vectorizer
from the scikit-learn library to extract unigrams
and bigrams
of tweets? I want to train a classifier with the output.
This is the code from scikit-learn:
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?',
]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
TfidfVectorizer
has an ngram_range
parameter to determin the range of n-grams you want in the final matrix as new features. In your case, you want (1,2)
to go from unigrams to bigrams:
vectorizer = TfidfVectorizer(ngram_range=(1,2))
X = vectorizer.fit_transform(corpus).todense()
pd.DataFrame(X, columns=vectorizer.get_feature_names())
and and this document document is first first document \
0 0.000000 0.000000 0.314532 0.000000 0.388510 0.388510
1 0.000000 0.000000 0.455513 0.356824 0.000000 0.000000
2 0.357007 0.357007 0.000000 0.000000 0.000000 0.000000
3 0.000000 0.000000 0.282940 0.000000 0.349487 0.349487
is is the is this one ... the the first \
0 0.257151 0.314532 0.000000 0.000000 ... 0.257151 0.388510
1 0.186206 0.227756 0.000000 0.000000 ... 0.186206 0.000000
2 0.186301 0.227873 0.000000 0.357007 ... 0.186301 0.000000
3 0.231322 0.000000 0.443279 0.000000 ... 0.231322 0.349487
...