I have the following dataframe
0 1 2 3 4 5 6
0 i love eating spicy hand pulled noodles
1 i also like to game alot
I'd like to apply a function to create a new dataframe, but instead of the above words, the df will be populated with each words's part of speech tag.
I'm using nltk.pos_tag
, and I did this df.apply(nltk.pos_tag)
.
My expected output should look like this:
0 1 2 3 4 5 6
0 NN NN VB JJ NN VB NN
1 NN DT NN NN VB DT
However, I get IndexError: ('string index out of range', 'occurred at index 6')
Also, I understand that nltk.pos_tag will return a tuple output in the format of: ('word', 'pos_tag')
. So some further manipulation may be required to only get the tag. Any suggestions on how to go about doing this efficiently?
Traceback:
Traceback (most recent call last):
File "PartsOfSpeech.py", line 71, in <module>
FilteredTrees = pos.run_pos(data.lower())
File "PartsOfSpeech.py", line 59, in run_pos
df = df.apply(pos_tag)
File "/anaconda3/envs/customer_sentiment/lib/python3.6/site-packages/pandas/core/frame.py", line 6487, in apply
return op.get_result()
File "/anaconda3/envs/customer_sentiment/lib/python3.6/site-packages/pandas/core/apply.py", line 151, in get_result
return self.apply_standard()
File "/anaconda3/envs/customer_sentiment/lib/python3.6/site-packages/pandas/core/apply.py", line 257, in apply_standard
self.apply_series_generator()
File "/anaconda3/envs/customer_sentiment/lib/python3.6/site-packages/pandas/core/apply.py", line 286, in apply_series_generator
results[i] = self.f(v)
File "/anaconda3/envs/customer_sentiment/lib/python3.6/site-packages/nltk/tag/__init__.py", line 162, in pos_tag
return _pos_tag(tokens, tagset, tagger, lang)
File "/anaconda3/envs/customer_sentiment/lib/python3.6/site-packages/nltk/tag/__init__.py", line 119, in _pos_tag
tagged_tokens = tagger.tag(tokens)
File "/anaconda3/envs/customer_sentiment/lib/python3.6/site-packages/nltk/tag/perceptron.py", line 157, in tag
context = self.START + [self.normalize(w) for w in tokens] + self.END
File "/anaconda3/envs/customer_sentiment/lib/python3.6/site-packages/nltk/tag/perceptron.py", line 157, in <listcomp>
context = self.START + [self.normalize(w) for w in tokens] + self.END
File "/anaconda3/envs/customer_sentiment/lib/python3.6/site-packages/nltk/tag/perceptron.py", line 242, in normalize
elif word[0].isdigit():
You can use applymap.
df.fillna('').applymap(lambda x: nltk.pos_tag([x])[0][1] if x!='' else '')
0 1 2 3 4 5 6
0 NN NN VBG NN NN VBD NNS
1 NN RB IN TO NN NN
Note: If your dataframe is large, it'll be more efficient to tag the entire sentences and then convert the tags into a dataframe. The current approach will be slow with big dataset.