Using NLTK
Unigram Tagger, I am training sentences in Brown Corpus
I try different categories
and I get about the same value. The value is around 0.9328
... for each categories
such as fiction
, romance
or humor
from nltk.corpus import brown
# Fiction
brown_tagged_sents = brown.tagged_sents(categories='fiction')
brown_sents = brown.sents(categories='fiction')
unigram_tagger = nltk.UnigramTagger(brown_tagged_sents)
unigram_tagger.evaluate(brown_tagged_sents)
>>> 0.9415956079897209
# Romance
brown_tagged_sents = brown.tagged_sents(categories='romance')
brown_sents = brown.sents(categories='romance')
unigram_tagger = nltk.UnigramTagger(brown_tagged_sents)
unigram_tagger.evaluate(brown_tagged_sents)
>>> 0.9348490474422324
Why is it that the case? is it because they are from the same corpus
? or are their part-of-speech
tagging is the same?
It looks like you are training and then evaluating the trained UnigramTagger
on the same training data. Take a look at the documentation of nltk.tag and specifically the part about evaluation.
With your code, you will get a high score which is quite obvious because your training data and evaluation/testing data is the same. If you were to change that where the testing data is different from the training data, you will get different results. My examples are below:
Category: Fiction
Here I have used the training set as brown.tagged_sents(categories='fiction')[:500]
and the test/evaluation set as brown.tagged_sents(categories='fiction')[501:600]
from nltk.corpus import brown
import nltk
# Fiction
brown_tagged_sents = brown.tagged_sents(categories='fiction')[:500]
brown_sents = brown.sents(categories='fiction') # not sure what this line is doing here
unigram_tagger = nltk.UnigramTagger(brown_tagged_sents)
unigram_tagger.evaluate(brown.tagged_sents(categories='fiction')[501:600])
This gives you a score of ~ 0.7474610697359513
Category: Romance
Here I have used the training set as brown.tagged_sents(categories='romance')[:500]
and the test/evaluation set as brown.tagged_sents(categories='romance')[501:600]
from nltk.corpus import brown
import nltk
# Romance
brown_tagged_sents = brown.tagged_sents(categories='romance')[:500]
brown_sents = brown.sents(categories='romance') # not sure what this line is doing here
unigram_tagger = nltk.UnigramTagger(brown_tagged_sents)
unigram_tagger.evaluate(brown.tagged_sents(categories='romance')[501:600])
This gives you a score of ~ 0.7046799354491662
I hope this helps and answers your question.