Can NLTK pos tagger recognize contractions correctly?

I want to know if I need to write a de-contraction function before sending a given text to NLTK's pos tagger. I am reluctant to tokenize words because they might end up like (don't='do',"'nt") which I suspect would make pos tagging more difficult.

In short, my questions are: Does nltk's pos tagger recognize most contractions (from my limited experience it seems to work well w/o word tokenization)? Will word tokenization (as opposed to simple word splitting) improve or impair the process? Would it just be easier for me to write a de-contraction function? Are there any other pos taggers that recognize contractions?

example_text="I can't and I won't go to the park because I don't like grass."

Solution

I want to know if I need to write a de-contraction function before sending a given text to NLTK's pos tagger.

You do not. The default nltk tagger is trained with text that was tokenized with the default nltk tokenization, and works correctly with text that is tokenized the same way. Anything else would be a bug in the nltk. So if you change the tokenizer you will make performance worse, not better.

If you try your own example you'll see that it correctly tags "ca" and "wo" as MD (modal verb), even though there are no such words in English; I don't particularly like it (why not just tokenize "can't" as "can n't"?), but the tagger certainly knows what to do with it.

>>> nltk.pos_tag(nltk.word_tokenize(example_text))
[('I', 'PRP'), ('ca', 'MD'), ("n't", 'RB'), ('and', 'CC'), ('I', 'PRP'),
 ('wo', 'MD'), ("n't", 'RB'), ('go', 'VB'), ('to', 'TO'), ('the', 'DT'),
 ('park', 'NN'), ('because', 'IN'), ('I', 'PRP'), ('do', 'VBP'), ("n't", 'RB'), 
('like', 'VB'), ('grass', 'NN'), ('.', '.')]

Will the tagger get some things wrong? Definitely. No tagger is perfect. But if you want better performance, you need to find or train a better tagger. You can't "improve" the word tokenizer that the tagger is designed to work with.

PS. You should only pass one (tokenized) sentence at a time to the tagger. If you pass it your entire file as a list of words, you do lose performance unnecessarily. This is how you should do it:

sents = [ nltk.word_tokenize(s) for s in nltk.sent_tokenize(long_text) ]
nltk.pos_tag_sents(sents)