Search code examples
optimizationscikit-learnfeature-detectionvocabularysklearn-pandas

Adding a self build vocabulary in scikit-learn?


In sklearn.feature_extraction.text.TfidfVectorizer, we can inject our own vocabulary using vocabulary parameter of the model. but in this case only my own selected words are used for the model.

I want to use automatically detected features with my custom vocabulary.

One way to solve this problem is to create the model and get the features using

vocab=vectorizer.get_feature_names()

appending my list on vocab

vocab + vocabulary

and again build the model.

Is there a way to perform this whole process in a single step?


Solution

  • I don't think there is a simpler way than that to achieve what you want. One thing you can do is to use the code of CountVectorizer used to create the vocabulary. I went through the source code and the method is

    _count_vocab(self, raw_documents, fixed_vocab)
    

    called with fixed_vocab=False.

    So what I suggest is for you to adapt the following code (Source) to create the vocabulary before you run the TfidfVectorizer.

    def _count_vocab(self, raw_documents, fixed_vocab):
            """Create sparse feature matrix, and vocabulary where fixed_vocab=False
            """
            if fixed_vocab:
                vocabulary = self.vocabulary_
            else:
                # Add a new value when a new vocabulary item is seen
                vocabulary = defaultdict()
                vocabulary.default_factory = vocabulary.__len__
    
            analyze = self.build_analyzer()
            j_indices = _make_int_array()
            indptr = _make_int_array()
            indptr.append(0)
            for doc in raw_documents:
                for feature in analyze(doc):
                    try:
                        j_indices.append(vocabulary[feature])
                    except KeyError:
                        # Ignore out-of-vocabulary items for fixed_vocab=True
                        continue
                indptr.append(len(j_indices))
    
            if not fixed_vocab:
                # disable defaultdict behaviour
                vocabulary = dict(vocabulary)
                if not vocabulary:
                    raise ValueError("empty vocabulary; perhaps the documents only"
                                     " contain stop words")
    
            j_indices = frombuffer_empty(j_indices, dtype=np.intc)
            indptr = np.frombuffer(indptr, dtype=np.intc)
            values = np.ones(len(j_indices))
    
            X = sp.csr_matrix((values, j_indices, indptr),
                              shape=(len(indptr) - 1, len(vocabulary)),
                              dtype=self.dtype)
            X.sum_duplicates()
            return vocabulary, X