Search code examples
pythonscikit-learnfeature-selectioncountvectorizerdictvectorizer

Is it possible to create an equivalent "restrict" method for CountVectorizer as is available for DictVectorizer in Scikit-learn?


For DictVectorizer it is possible to subset the object by using the restrict() method. Here is an example where I have explicitly listed the features to retain by using a boolean array.

import numpy as np
v = DictVectorizer()
D = [{'foo': 1, 'bar': 2}, {'foo': 3, 'baz': 1}]
X = v.fit_transform(D)
v.get_feature_names()
>>['bar', 'baz', 'foo']
user_list = np.array([False, False, True], dtype=bool)
v.restrict(user_list)
v.get_feature_names()
>>['foo']

I would like to have the same ability within a non-normalized CountVectorizer object. I have not discovered any means to slice the np object coming from CountVectorizer as there are many dependent attributes. The reason for my interest is that this removes the need to repeatedly fit and transform the text data under the scenario of simply removing features post hoc to the first fit and transform of the text data. Is there an equivalent method that I am missing or can a custom method be easily created for CountVectorizer?

UPDATE based on @Vivek's response

This method appears to work. Here is my code to implement this directly within a python session.

from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
v = CountVectorizer()
D = ['Data science is about the data', 'The science is amazing', 'Predictive modeling is part of data science']
v.fit_transform(D)
print(v.get_feature_names())
print(len(v.get_feature_names()))
>> ['about', 'amazing', 'data', 'is', 'modeling', 'of', 'part', 'predictive', 'science', 'the']
>> 10
user_list = np.array([False, False, True, False, False, True, False, False, True, False], dtype=bool)

new_vocab = {}
for i in np.where(user_list)[0]:
    print(v.get_feature_names()[i])
    new_vocab[v.get_feature_names()[i]] = len(new_vocab)
new_vocab

>> data
>> of
>> science
>> {'data': 0, 'of': 1, 'science': 2}

v_copy = cp.deepcopy(v)
v_copy.vocabulary_ = new_vocab
print(v_copy.vocabulary_)
print(v_copy.get_feature_names())
v_copy

>> {'data': 0, 'of': 1, 'science': 2}
>> ['data', 'of', 'science']
>> CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 1), preprocessor=None, stop_words=None,
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

v_copy.transform(D).toarray()
>> array([[2, 0, 1],
       [0, 0, 1],
       [1, 1, 1]], dtype=int64)

Thanks @Vivek! This appears to behave as expected for a non-normalized CountVectorizer object.


Solution

  • Answer implementing @Vivek's recommendation in the form of a comment to original question:

    from sklearn.feature_extraction.text import CountVectorizer
    import numpy as np
    v = CountVectorizer()
    D = ['Data science is about the data', 'The science is amazing', 'Predictive modeling is part of data science']
    v.fit_transform(D)
    print(v.get_feature_names())
    print(len(v.get_feature_names()))
    >> ['about', 'amazing', 'data', 'is', 'modeling', 'of', 'part', 'predictive', 'science', 'the']
    >> 10
    user_list = np.array([False, False, True, False, False, True, False, False, True, False], dtype=bool)
    
    new_vocab = {}
    for i in np.where(user_list)[0]:
        print(v.get_feature_names()[i])
        new_vocab[v.get_feature_names()[i]] = len(new_vocab)
    new_vocab
    
    >> data
    >> of
    >> science
    >> {'data': 0, 'of': 1, 'science': 2}
    
    v_copy = cp.deepcopy(v)
    v_copy.vocabulary_ = new_vocab
    print(v_copy.vocabulary_)
    print(v_copy.get_feature_names())
    v_copy
    
    >> {'data': 0, 'of': 1, 'science': 2}
    >> ['data', 'of', 'science']
    >> CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                    dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                    lowercase=True, max_df=1.0, max_features=None, min_df=1,
                    ngram_range=(1, 1), preprocessor=None, stop_words=None,
                    strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                    tokenizer=None, vocabulary=None)
    
    v_copy.transform(D).toarray()
    >> array([[2, 0, 1],
           [0, 0, 1],
           [1, 1, 1]], dtype=int64)