In sklearn.feature_extraction.text.TfidfVectorizer
, we can inject our own vocabulary using vocabulary
parameter of the model. but in this case only my own selected words are used for the model.
I want to use automatically detected features with my custom vocabulary.
One way to solve this problem is to create the model and get the features using
appending my list on vocab
vocab + vocabulary
and again build the model.
Is there a way to perform this whole process in a single step?
I don't think there is a simpler way than that to achieve what you want. One thing you can do is to use the code of CountVectorizer used to create the vocabulary. I went through the source code and the method is
_count_vocab(self, raw_documents, fixed_vocab)
called with fixed_vocab=False
So what I suggest is for you to adapt the following code (Source) to create the vocabulary before you run the TfidfVectorizer
def _count_vocab(self, raw_documents, fixed_vocab):
"""Create sparse feature matrix, and vocabulary where fixed_vocab=False
if fixed_vocab:
vocabulary = self.vocabulary_
# Add a new value when a new vocabulary item is seen
vocabulary = defaultdict()
vocabulary.default_factory = vocabulary.__len__
analyze = self.build_analyzer()
j_indices = _make_int_array()
indptr = _make_int_array()
for doc in raw_documents:
for feature in analyze(doc):
except KeyError:
# Ignore out-of-vocabulary items for fixed_vocab=True
if not fixed_vocab:
# disable defaultdict behaviour
vocabulary = dict(vocabulary)
if not vocabulary:
raise ValueError("empty vocabulary; perhaps the documents only"
" contain stop words")
j_indices = frombuffer_empty(j_indices, dtype=np.intc)
indptr = np.frombuffer(indptr, dtype=np.intc)
values = np.ones(len(j_indices))
X = sp.csr_matrix((values, j_indices, indptr),
shape=(len(indptr) - 1, len(vocabulary)),
return vocabulary, X