I am writing a function to do custom word removal, stemming (getting the root form of the word) and then tf-idf.
My input data to the function is a list. If I try to do custom word removal on individual list, that works, but when I combine it in the function, I get an attribute error:
AttributeError: 'list' object has no attribute 'lower'
Here is my code:
def tfidf_kw(K):
# Select docs in cluster K
docs = np.array(mydata2)[km_r3.labels_==K]
ps= PorterStemmer()
stem_docs = []
for doc in docs:
keep_tokens = []
for token in doc.split(' '):
#custom stopword removal
my_list = ['model', 'models', 'modeling', 'modelling', 'python',
'train','training', 'trains', 'trained','test','testing', 'tests','tested']
token = [sub_token for sub_token in list(doc) if sub_token not in my_list]
stem_token=ps.stem(token)
keep_tokens.append(stem_token)
keep_tokens =' '.join(keep_tokens)
stem_docs.append(keep_tokens)
return(keep_tokens)
Further code is for tf-idf, which works. This is where I need help, to understand what am I doing wrong?
token = [sub_token for sub_token in list(doc) if sub_token not in my_list]
Here is the complete error:
AttributeError Traceback (most recent call last)
<ipython-input-154-528a540678b0> in <module>
49 #return(sorted_df)
50
---> 51 tfidf_kw(0)
<ipython-input-154-528a540678b0> in tfidf_kw(K)
20
21
---> 22 stem_token=ps.stem(token)
23 keep_tokens.append(stem_token)
24
~/opt/anaconda3/lib/python3.8/site-packages/nltk/stem/porter.py in stem(self, word)
650
651 def stem(self, word):
--> 652 stem = word.lower()
653
654 if self.mode == self.NLTK_EXTENSIONS and word in self.pool:
AttributeError: 'list' object has no attribute 'lower'
On line 51, where it says tfidf_kw(0)
, that's where I am checking the function for k=0.
Apparently the ps.stem
method expects a single word (a string) as argument, but you are passing a list of strings.
Since you are already inside a for token in doc.split(' ')
loop it does not seem to make sense to me to use a list comprehension [... for sub_token in list(doc) ...]
additionally.
If your goal is to skip those tokens that are in my_list
, presumably you want to write the for token in doc.split(' ')
loop like this:
for token in doc.split(' '):
my_list = ['model', 'models', 'modeling', 'modelling', 'python',
'train','training', 'trains', 'trained','test','testing', 'tests','tested']
if token in my_list:
continue
stem_token=ps.stem(token)
keep_tokens.append(stem_token)
Here, if token
is one of the words in my_list
, the continue
statement skips the rest of the current iteration and the loop continues with the next token
.