I'm doing text classification with using Python and scikit-learn.
Now, I use TfidfVectorizer as vectorizer (for transform raw text to a feature vector) and MultinomialNB as a classifier. I use parameter ngram_range = (1,2) (see http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html ), e.g. I use one word and bigrams.
After classification and test my algorithm in test set and CV set, I'd like to improve accuracy. I see the most informative features ( due to question How to get most informative features for scikit-learn classifiers? ). And I see, that in the set most informative features I have words ( ngram=1), that don't have impact to classification, but in bigram (words collocations) they will have great impact.
So, I can't use stop_words, because Tfidfvectorizer will not consider this words in collocations and I can't use preprocessor at the same reason. Question: How can I exclude some words in tfidfvectorizer, but save this words in different collocations?
I think there are a few possible ways of doing it:
Construct two TfidfVectorizer
twice both with ngram_range=(1,2)
. Extract the feature names after fitting the first vectorizer, filter out unwanted the unigram features, and feed this list of features as the vocabulary
argument of the second vectorizer. Use the second vectorizer for transformation.
Supply the analyzer
argument of TfidfVectorizer
as a function which performs customized extraction of features from each raw document, e.g. avoid spitting out some useless unigram as feature (but this means you need to do the work of generating words combinations yourself).
Fit a TfidfVectorizer
as usual, which might contain some unwanted unigrams. Use get_feature_names()
to get the column indices corresponding to the features you want. When you do transform()
using the vectorizer, do an extra step of slicing the columns of the resulting sparse matrix, based on the indices of interest.