Evaluating POS tagger in NLTK

I want to evaluate different POS tags in NLTK using a text file as an input.

For an example, I will take Unigram tagger. I have found how to evaluate Unigram tag using brown corpus.

from nltk.corpus import brown
import nltk

brown_tagged_sents = brown.tagged_sents(categories='news')
brown_sents = brown.sents(categories='news')
# We train a UnigramTagger by specifying tagged sentence data as a parameter
# when we initialize the tagger.
unigram_tagger = nltk.UnigramTagger(brown_tagged_sents)
print(unigram_tagger.tag(brown_sents[2007]))
print(unigram_tagger.evaluate(brown_tagged_sents))

It produces an output like below.

[('Various', 'JJ'), ('of', 'IN'), ('the', 'AT'), ('apartments', 'NNS'), ('are', 'BER'), ('of', 'IN'), ('the', 'AT'), ('terrace', 'NN'), ('type', 'NN'), (',', ','), ('being', 'BEG'), ('on', 'IN'), ('the', 'AT'), ('ground', 'NN'), ('floor', 'NN'), ('so', 'QL'), ('that', 'CS'), ('entrance', 'NN'), ('is', 'BEZ'), ('direct', 'JJ'), ('.', '.')]
0.9349006503968017

In a similar manner, I want to read text from a text file and evaluate the accuracy of different POS taggers.

I figured out how to read a text file and how to apply pos tags for the tokens.

import nltk
from nltk.corpus import brown
from nltk.corpus import state_union

brown_tagged_sents = brown.tagged_sents(categories='news')

sample_text = state_union.raw(
    r"C:\pythonprojects\tagger_nlt\new-testing.txt")
tokens = nltk.word_tokenize(sample_text)

default_tagger = nltk.UnigramTagger(brown_tagged_sents)

default_tagger.tag(tokens)

print(default_tagger.tag(tokens))
[('Honestly', None), ('last', 'AP'), ('seven', 'CD'), ('lectures', None), ('are', 'BER'), ('good', 'JJ'), ('.', '.'), ('Lectures', None), ('are', 'BER'), ('understandable', 'JJ')

What I wanted to have is a score like default_tagger.evaluate(), so that I can compare different POS taggers in NLTK using the same input file to identify the most suited POS tagger for a given file.

Any help will be appreciated.

Solution

This questions is essentially a question about model evaluation metrics. In this case, our model is a POS tagger, specifically the UnigramTagger

Quantifying

You want to know "how well" your tagger is doing. This is a qualitative question, so we have some general quantitative metrics to help define what "how well" means. Basically, we have standard metrics to give us this information. They are usually accuracy, precision, recall and f1-score.

Evaluating

First off, we would need some data that is marked up with POS tags, then we can test. This is usually referred to as a train/test split, since some of the data we use for training the POS tagger, and some is used for testing or evaluating it's performance.

Since POS tagging is traditionally a supervised learning question, we need some sentences with POS tags to train and test with.

In practice, people label a bunch of sentences then split them to make a test and train set. The NLTK book explains this well, Let's try it out.

from nltk import UnigramTagger
from nltk.corpus import brown
# we'll use the brown corpus with universal tagset for readability
tagged_sentences = brown.tagged_sents(categories="news", tagset="universal")

# let's keep 20% of the data for testing, and 80 for training
i = int(len(tagged_sentences)*0.2)
train_sentences = tagged_sentences[i:]
test_sentences = tagged_sentences[:i]

# let's train the tagger with out train sentences
unigram_tagger = UnigramTagger(train_sentences)
# now let's evaluate with out test sentences
# default evaluation metric for nltk taggers is accuracy
accuracy = unigram_tagger.evaluate(test_sentences)

print("Accuracy:", accuracy)
Accuracy: 0.8630364649525858

Now, accuracy is an OK metric for knowing "how many you got right", but there are other metrics that give us more detail, such as precision, recall and f1-score. We can use sklearn's classification_report to give us a good overview of the results.

tagged_test_sentences = unigram_tagger.tag_sents([[token for token,tag in sent] for sent in test_sentences])
gold = [str(tag) for sentence in test_sentences for token,tag in sentence]
pred = [str(tag) for sentence in tagged_test_sentences for token,tag in sentence]
from sklearn import metrics
print(metrics.classification_report(gold, pred))

             precision    recall  f1-score   support

          .       1.00      1.00      1.00      2107
        ADJ       0.89      0.79      0.84      1341
        ADP       0.97      0.92      0.94      2621
        ADV       0.93      0.79      0.86       573
       CONJ       1.00      1.00      1.00       453
        DET       1.00      0.99      1.00      2456
       NOUN       0.96      0.76      0.85      6265
        NUM       0.99      0.85      0.92       379
       None       0.00      0.00      0.00         0
       PRON       1.00      0.96      0.98       502
        PRT       0.69      0.96      0.80       481
       VERB       0.96      0.83      0.89      3274
          X       0.10      0.17      0.12         6

avg / total       0.96      0.86      0.91     20458

Now we have some ideas and values we can look at to quantify our taggers, but I am sure you are thinking, "That's all well and good, but how well does it perform on random sentences?"

Simply put, it is what was mentioned in other answers, unless you have your own POS tagged data on sentences we want to test, we will never know for sure!