Search code examples
pythonnltkpos-tagger

NLTK Brill Tagger Splitting Words


I am using python version 3.4.1 and NLTK version 3 and I am trying to use their Brill Tagger.

Here is the training code for the brill tagger:

import nltk
from nltk.tag.brill import *
import nltk.tag.brill_trainer as bt
from nltk.corpus import brown

Template._cleartemplates()
templates = fntbl37()
tagged_sentences = brown.tagged_sents(categories = 'news')
tagged_sentences = tagged_sentences[:]
tagger = nltk.tag.BigramTagger(tagged_sentences)
tagger = bt.BrillTaggerTrainer(tagger, templates, trace=3)
tagger = tagger.train(tagged_sentences, max_rules=250)
print(tagger.evaluate(brown.tagged_sents(categories='fiction')[:]))
print(tagger.tag("Hi I am Harry Potter."))

The output to the last command however is:

[('H', 'NN'), ('i', 'NN'), (' ', 'NN'), ('I', 'NN'), (' ', 'NN'), ('a', 'AT'), ('m', 'NN'), (' ', 'NN'), ('H', 'NN'), ('a', 'AT'), ('r', 'NN'), ('r', 'NN'), ('y', 'NN'), (' ', 'NN'), ('P', 'NN'), ('o', 'NN'), ('t', 'NN'), ('t', 'NN'), ('e', 'NN'), ('r', 'NN'), ('.', '.')]

How do I stop it from splitting the words into letters and tagging the letters instead of the word?


Solution

  • Tag tag() function expects a list of tokens as input. Since you give it a string as input, this string gets interpreted as a list. Turning a string into a list gives you a list of characters:

    >>> list("abc")
    ['a', 'b', 'c']
    

    All you need to do is turn your string into a list of tokens before tagging. For example with nltk or simply by splitting at whitespaces:

    >>> import nltk
    >>> nltk.word_tokenize("Hi I am Harry Potter.")
    ['Hi', 'I', 'am', 'Harry', 'Potter', '.']
    >>> "Hi I am Harry Potter.".split(' ')
    ['Hi', 'I', 'am', 'Harry', 'Potter.']
    

    Adding tokenization in the tagging gives the following result:

    print(tagger.tag(nltk.word_tokenize("Hi I am Harry Potter.")))
    [('Hi', 'NN'), ('I', 'PPSS'), ('am', 'VB'), ('Harry', 'NN'), ('Potter', 'NN'), ('.', '.')]