Search code examples
nlpnltkstanford-nlpallennlp

Unigram tagging in NLTK


Using NLTK Unigram Tagger, I am training sentences in Brown Corpus

I try different categories and I get about the same value. The value is around 0.9328... for each categories such as fiction, romance or humor

from nltk.corpus import brown


# Fiction    
brown_tagged_sents = brown.tagged_sents(categories='fiction')
brown_sents = brown.sents(categories='fiction')
unigram_tagger = nltk.UnigramTagger(brown_tagged_sents)
unigram_tagger.evaluate(brown_tagged_sents)
>>> 0.9415956079897209

# Romance
brown_tagged_sents = brown.tagged_sents(categories='romance')
brown_sents = brown.sents(categories='romance')
unigram_tagger = nltk.UnigramTagger(brown_tagged_sents)
unigram_tagger.evaluate(brown_tagged_sents)
>>> 0.9348490474422324

Why is it that the case? is it because they are from the same corpus? or are their part-of-speech tagging is the same?


Solution

  • It looks like you are training and then evaluating the trained UnigramTagger on the same training data. Take a look at the documentation of nltk.tag and specifically the part about evaluation.

    With your code, you will get a high score which is quite obvious because your training data and evaluation/testing data is the same. If you were to change that where the testing data is different from the training data, you will get different results. My examples are below:

    Category: Fiction

    Here I have used the training set as brown.tagged_sents(categories='fiction')[:500] and the test/evaluation set as brown.tagged_sents(categories='fiction')[501:600]

    from nltk.corpus import brown
    import nltk
    
    # Fiction    
    brown_tagged_sents = brown.tagged_sents(categories='fiction')[:500]
    brown_sents = brown.sents(categories='fiction') # not sure what this line is doing here
    unigram_tagger = nltk.UnigramTagger(brown_tagged_sents)
    unigram_tagger.evaluate(brown.tagged_sents(categories='fiction')[501:600])
    

    This gives you a score of ~ 0.7474610697359513

    Category: Romance

    Here I have used the training set as brown.tagged_sents(categories='romance')[:500] and the test/evaluation set as brown.tagged_sents(categories='romance')[501:600]

    from nltk.corpus import brown
    import nltk
    
    # Romance
    brown_tagged_sents = brown.tagged_sents(categories='romance')[:500]
    brown_sents = brown.sents(categories='romance') # not sure what this line is doing here
    unigram_tagger = nltk.UnigramTagger(brown_tagged_sents)
    unigram_tagger.evaluate(brown.tagged_sents(categories='romance')[501:600])
    

    This gives you a score of ~ 0.7046799354491662

    I hope this helps and answers your question.