I am working with a basic tagger from the NLTK package. I was previously using OpenNLP's tagging system. I am switching because NLTK has more pre-built modules that I could use later on in my project. But the one thing I am now missing is a "Confidence" value given by the tagger.
Originally with the OpenNLP set up I got a numerical value (ranging from 0 to 1) that told me how confident the tagger was in its decision (0 being not at all confident and 1 being completely confident). I was wondering if anyone knew any values in NLTK's tagging system that could work similarly. It doesn't have to be the identical system, but I was hoping for some sort of numerical ranking that would let me easily see if a given tag was something I should be double checking or not.
The one thing I do have in NLTK that is similar to the Confidence Value. Is an overall accuracy rating for the tagger, however that is only provided with a pre-tagged source and is for the overall document not per word basis.
My thoughts were that maybe there was some statistical determining as to which tag is chosen for words and if I could get that it might be usable as a similar measure but I cannot find anything of the like.
Thanks!
NLTK Taggers do not provide a direct confidence value for each token, but the Naive Bayes Tagger allows to pass a cutoff probability:
tagger = ClassifierBasedPOSTagger(train=training_sentences, cutoff_prob=0.95)
The tagger will then return None if the confidence for the POS tag is below 95 %. I found 0.95 to be a good tradeoff between precision and recall (of course this depends on the needs of your application).