Search code examples
pythonpandassklearn-pandascountvectorizer

get_feature_names not found in countvectorizer()


I'm mining the Stack Overflow data dump of posts about deep learning libraries. I'd like to identify stop words in my corpus (like 'python' for instance). I want to get my feature names so I can identify the words with highest term frequencies.

I create my documents and my corpus as follows:

with open("StackOverflow_2018_Data.csv") as csv_file:
    csv_reader = csv.reader(csv_file, delimiter=',')
    line_count = 0
    pytorch_doc = ''
    tensorflow_doc = ''
    cotag_list = []
    keras_doc = ''
    counte = 0
    for row in csv_reader:
        if row[2] == 'tensorflow':
            tensorflow_doc += row[3] + ' '
        if row[2] == 'keras':
            keras_doc += row[3] + ' '
        if row[2] == 'pytorch':
            pytorch_doc += row[3] + ' '

corpus = [pytorch_doc, tensorflow_doc, keras_doc]
vectorizer = CountVectorizer()
x = vectorizer.fit_transform(corpus)
print(x)
x.toarray()
Dict = []
feat = x.get_feature_names()
for i,arr in enumerate(x):
    for x, ele in enumerate(arr):
        if i == 0:
            Dict += ('pytorch', feat[x], ele)
        if i == 1:
            Dict += ('tensorflow', feat[x], ele)
        if i == 2:
            Dict += ('keras', feat[x], ele)

sorted_arr = sorted(Dict, key=lambda tup: tup[2])

However, I am getting:

  File "sklearn_stopwords.py", line 83, in <module>
    main()
  File "sklearn_stopwords.py", line 50, in main
    feat = x.get_feature_names()
  File "/opt/anaconda3/lib/python3.7/site-packages/scipy/sparse/base.py", line 686, in __getattr__
    raise AttributeError(attr + " not found")
AttributeError: get_feature_names not found

Solution

  • get_feature_names is a method in the CountVectorizer Object. You are trying to access get_feature_names the results of fit_transform which is a scipy.sparse matrix.

    You need to use vectorizer.get_feature_names().

    Try this MVCE:

    from sklearn.feature_extraction.text import CountVectorizer
    vectorizer = CountVectorizer()
    corpus = ['This is the first document.',
              'This is the second second document.',
              'And the third one.',
              'Is this the first document?']
    
    X = vectorizer.fit_transform(corpus)
    
    features = vectorizer.get_feature_names()
    
    features
    

    Output:

    ['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']