Search code examples
nlpstanford-nlppos-tagger

Obtain multiple taggings with Stanford POS Tagger


I'm performing POS tagging with the Stanford POS Tagger. The tagger only returns one possible tagging for the input sentence. For instance, when provided with the input sentence "The clown weeps.", the POS tagger produces the (erroneous) "The_DT clown_NN weeps_NNS ._.".

However, my application will try to parse the result, and may reject a POS tagging because there is no way to parse it. Hence, in this example, it would reject "The_DT clown_NN weeps_NNS ._." but would accept "The_DT clown_NN weeps_VBZ ._." which I assume is a lower-confidence tagging for the parser.

I would therefore like the POS tagger to provide multiple hypotheses for the tagging of each word, annotated by some kind of confidence value. In this way, my application could choose the POS tagging with highest confidence that achieves a valid parsing for its purposes.

I have found no way to ask the Stanford POS Tagger to produce multiple (n-best) tagging hypotheses for each word (or even for the whole sentence). Is there a way to do this? (Alternatively, I am also OK with using another POS tagger with comparable performance that would have support for this.)


Solution

  • OpenNLP allows getting n best for POS tagging:

    Some applications need to retrieve the n-best pos tag sequences and not only the best sequence. The topKSequences method is capable of returning the top sequences. It can be called in a similar way as tag.

    Sequence topSequences[] = tagger.topKSequences(sent);
    

    Each Sequence object contains one sequence. The sequence can be retrieved via Sequence.getOutcomes() which returns a tags array and Sequence.getProbs() returns the probability array for this sequence.

    Also, there is also a way to make spaCy do something like this:

    Doc.set_extension('tag_scores', default=None)
    Token.set_extension('tag_scores', getter=lambda token: token.doc._.tag_scores[token.i])
    
    class ProbabilityTagger(Tagger):
        def predict(self, docs):
            tokvecs = self.model.tok2vec(docs)
            scores = self.model.softmax(tokvecs)
            guesses = []
            for i, doc_scores in enumerate(scores):
                docs[i]._.tag_scores = doc_scores
                doc_guesses = doc_scores.argmax(axis=1)
    
                if not isinstance(doc_guesses, numpy.ndarray):
                    doc_guesses = doc_guesses.get()
                guesses.append(doc_guesses)
            return guesses, tokvecs
    
    
    Language.factories['tagger'] = lambda nlp, **cfg: ProbabilityTagger(nlp.vocab, **cfg)
    

    Then each token will have tag_scores with the probabilities for each part of speech from spaCy's tag map.

    Source: https://github.com/explosion/spaCy/issues/2087