Search code examples
python-3.xpandasnltkpos-tagger

Apply nltk.pos_tag to entire dataframe


I have the following dataframe

   0     1       2      3     4       5        6
0  i  love  eating  spicy  hand  pulled  noodles
1  i  also    like     to  game    alot         

I'd like to apply a function to create a new dataframe, but instead of the above words, the df will be populated with each words's part of speech tag.

I'm using nltk.pos_tag, and I did this df.apply(nltk.pos_tag).

My expected output should look like this:

   0    1    2    3    4    5    6
0  NN   NN   VB   JJ   NN   VB   NN
1  NN   DT   NN   NN   VB   DT   

However, I get IndexError: ('string index out of range', 'occurred at index 6')

Also, I understand that nltk.pos_tag will return a tuple output in the format of: ('word', 'pos_tag'). So some further manipulation may be required to only get the tag. Any suggestions on how to go about doing this efficiently?


Traceback:

Traceback (most recent call last):
  File "PartsOfSpeech.py", line 71, in <module>
    FilteredTrees = pos.run_pos(data.lower())
  File "PartsOfSpeech.py", line 59, in run_pos
    df = df.apply(pos_tag)
  File "/anaconda3/envs/customer_sentiment/lib/python3.6/site-packages/pandas/core/frame.py", line 6487, in apply
    return op.get_result()
  File "/anaconda3/envs/customer_sentiment/lib/python3.6/site-packages/pandas/core/apply.py", line 151, in get_result
    return self.apply_standard()
  File "/anaconda3/envs/customer_sentiment/lib/python3.6/site-packages/pandas/core/apply.py", line 257, in apply_standard
    self.apply_series_generator()
  File "/anaconda3/envs/customer_sentiment/lib/python3.6/site-packages/pandas/core/apply.py", line 286, in apply_series_generator
    results[i] = self.f(v)
  File "/anaconda3/envs/customer_sentiment/lib/python3.6/site-packages/nltk/tag/__init__.py", line 162, in pos_tag
    return _pos_tag(tokens, tagset, tagger, lang)
  File "/anaconda3/envs/customer_sentiment/lib/python3.6/site-packages/nltk/tag/__init__.py", line 119, in _pos_tag
    tagged_tokens = tagger.tag(tokens)
  File "/anaconda3/envs/customer_sentiment/lib/python3.6/site-packages/nltk/tag/perceptron.py", line 157, in tag
    context = self.START + [self.normalize(w) for w in tokens] + self.END
  File "/anaconda3/envs/customer_sentiment/lib/python3.6/site-packages/nltk/tag/perceptron.py", line 157, in <listcomp>
    context = self.START + [self.normalize(w) for w in tokens] + self.END
  File "/anaconda3/envs/customer_sentiment/lib/python3.6/site-packages/nltk/tag/perceptron.py", line 242, in normalize
    elif word[0].isdigit():

Solution

  • You can use applymap.

    df.fillna('').applymap(lambda x: nltk.pos_tag([x])[0][1] if x!='' else '')
    
        0   1   2   3   4   5   6
    0   NN  NN  VBG NN  NN  VBD NNS
    1   NN  RB  IN  TO  NN  NN  
    

    Note: If your dataframe is large, it'll be more efficient to tag the entire sentences and then convert the tags into a dataframe. The current approach will be slow with big dataset.