Search code examples
pythonstringnlpnltkpart-of-speech

Why is NLTK's PoS tagger tagging for each letter in a word instead of tagging for each word?


Say I have this sentence: I am a boy. I want to find out the Part of Speech of each word in the sentence. This is my code:

import nltk
sentence = 'I am a good boy'
for word in sentence:
    print(word)
    print(nltk.pos_tag(word))

But this produces the following output:

I
[('I', 'PRP')]

[(' ', 'NN')]
a
[('a', 'DT')]
m
[('m', 'NN')]

[(' ', 'NN')]
a
[('a', 'DT')]

[(' ', 'NN')]
g
[('g', 'NN')]
o
[('o', 'NN')]
o
[('o', 'NN')]
d
[('d', 'NN')]

[(' ', 'NN')]
b
[('b', 'NN')]
o
[('o', 'NN')]
y
[('y', 'NN')]

So, I tried to do this instead:

sentence = 'I am a good boy'
for word in sentence.split(' '):
    print(word)
    print(nltk.pos_tag(word))

And this produces the following output:

I
[('I', 'PRP')]
am
[('a', 'DT'), ('m', 'NN')]
a
[('a', 'DT')]
good
[('g', 'NN'), ('o', 'MD'), ('o', 'VB'), ('d', 'NN')]
boy
[('b', 'NN'), ('o', 'NN'), ('y', 'NN')]

Why is it finding the PoS for each letter instead of each word? And how do I fix this?


Solution

  • nltk.pos_tag works on a list or list-like thing as an argument, and tags each element of that. So in your second example, it splits each string (i.e., each word) into letters, just like it split the sentence into letters in the first example. It works when you pass in the whole list you got from splitting the sentence:

    >>> nltk.pos_tag(sentence.split(" "))
    [('I', 'PRP'), ('am', 'VBP'), ('a', 'DT'), ('good', 'JJ'), ('boy', 'NN')]
    

    Per documentation, you usually pass in what NLTK's tokenization returned (which is a list of words/tokens).