I want to evaluate different POS tags in NLTK using a text file as an input.
For an example, I will take Unigram tagger. I have found how to evaluate Unigram tag using brown corpus.
from nltk.corpus import brown
import nltk
brown_tagged_sents = brown.tagged_sents(categories='news')
brown_sents = brown.sents(categories='news')
# We train a UnigramTagger by specifying tagged sentence data as a parameter
# when we initialize the tagger.
unigram_tagger = nltk.UnigramTagger(brown_tagged_sents)
print(unigram_tagger.tag(brown_sents[2007]))
print(unigram_tagger.evaluate(brown_tagged_sents))
It produces an output like below.
[('Various', 'JJ'), ('of', 'IN'), ('the', 'AT'), ('apartments', 'NNS'), ('are', 'BER'), ('of', 'IN'), ('the', 'AT'), ('terrace', 'NN'), ('type', 'NN'), (',', ','), ('being', 'BEG'), ('on', 'IN'), ('the', 'AT'), ('ground', 'NN'), ('floor', 'NN'), ('so', 'QL'), ('that', 'CS'), ('entrance', 'NN'), ('is', 'BEZ'), ('direct', 'JJ'), ('.', '.')]
0.9349006503968017
In a similar manner, I want to read text from a text file and evaluate the accuracy of different POS taggers.
I figured out how to read a text file and how to apply pos tags for the tokens.
import nltk
from nltk.corpus import brown
from nltk.corpus import state_union
brown_tagged_sents = brown.tagged_sents(categories='news')
sample_text = state_union.raw(
r"C:\pythonprojects\tagger_nlt\new-testing.txt")
tokens = nltk.word_tokenize(sample_text)
default_tagger = nltk.UnigramTagger(brown_tagged_sents)
default_tagger.tag(tokens)
print(default_tagger.tag(tokens))
[('Honestly', None), ('last', 'AP'), ('seven', 'CD'), ('lectures', None), ('are', 'BER'), ('good', 'JJ'), ('.', '.'), ('Lectures', None), ('are', 'BER'), ('understandable', 'JJ')
What I wanted to have is a score like default_tagger.evaluate(), so that I can compare different POS taggers in NLTK using the same input file to identify the most suited POS tagger for a given file.
Any help will be appreciated.
This questions is essentially a question about model evaluation metrics. In this case, our model is a POS tagger, specifically the UnigramTagger
You want to know "how well
" your tagger is doing. This is a qualitative
question, so we have some general quantitative
metrics to help define what "how well
" means. Basically, we have standard metrics to give us this information. They are usually accuracy
, precision
, recall
and f1-score
.
First off, we would need some data that is marked up with POS tags
, then we can test. This is usually referred to as a train/test
split, since some of the data we use for training the POS tagger, and some is used for testing or evaluating
it's performance.
Since POS tagging is traditionally a supervised learning
question, we need some sentences with POS tags to train and test with.
In practice, people label a bunch of sentences then split them to make a test
and train
set. The NLTK book explains this well, Let's try it out.
from nltk import UnigramTagger
from nltk.corpus import brown
# we'll use the brown corpus with universal tagset for readability
tagged_sentences = brown.tagged_sents(categories="news", tagset="universal")
# let's keep 20% of the data for testing, and 80 for training
i = int(len(tagged_sentences)*0.2)
train_sentences = tagged_sentences[i:]
test_sentences = tagged_sentences[:i]
# let's train the tagger with out train sentences
unigram_tagger = UnigramTagger(train_sentences)
# now let's evaluate with out test sentences
# default evaluation metric for nltk taggers is accuracy
accuracy = unigram_tagger.evaluate(test_sentences)
print("Accuracy:", accuracy)
Accuracy: 0.8630364649525858
Now, accuracy
is an OK metric for knowing "how many you got right
", but there are other metrics that give us more detail, such as precision
, recall
and f1-score
. We can use sklearn
's classification_report
to give us a good overview of the results.
tagged_test_sentences = unigram_tagger.tag_sents([[token for token,tag in sent] for sent in test_sentences])
gold = [str(tag) for sentence in test_sentences for token,tag in sentence]
pred = [str(tag) for sentence in tagged_test_sentences for token,tag in sentence]
from sklearn import metrics
print(metrics.classification_report(gold, pred))
precision recall f1-score support
. 1.00 1.00 1.00 2107
ADJ 0.89 0.79 0.84 1341
ADP 0.97 0.92 0.94 2621
ADV 0.93 0.79 0.86 573
CONJ 1.00 1.00 1.00 453
DET 1.00 0.99 1.00 2456
NOUN 0.96 0.76 0.85 6265
NUM 0.99 0.85 0.92 379
None 0.00 0.00 0.00 0
PRON 1.00 0.96 0.98 502
PRT 0.69 0.96 0.80 481
VERB 0.96 0.83 0.89 3274
X 0.10 0.17 0.12 6
avg / total 0.96 0.86 0.91 20458
Now we have some ideas and values we can look at to quantify our taggers, but I am sure you are thinking, "That's all well and good, but how well does it perform on random sentences?
"
Simply put, it is what was mentioned in other answers, unless you have your own POS tagged data on sentences we want to test, we will never know for sure!