NLTK Brill Tagger Splitting Words

I am using python version 3.4.1 and NLTK version 3 and I am trying to use their Brill Tagger.

Here is the training code for the brill tagger:

import nltk
from nltk.tag.brill import *
import nltk.tag.brill_trainer as bt
from nltk.corpus import brown

Template._cleartemplates()
templates = fntbl37()
tagged_sentences = brown.tagged_sents(categories = 'news')
tagged_sentences = tagged_sentences[:]
tagger = nltk.tag.BigramTagger(tagged_sentences)
tagger = bt.BrillTaggerTrainer(tagger, templates, trace=3)
tagger = tagger.train(tagged_sentences, max_rules=250)
print(tagger.evaluate(brown.tagged_sents(categories='fiction')[:]))
print(tagger.tag("Hi I am Harry Potter."))

The output to the last command however is:

[('H', 'NN'), ('i', 'NN'), (' ', 'NN'), ('I', 'NN'), (' ', 'NN'), ('a', 'AT'), ('m', 'NN'), (' ', 'NN'), ('H', 'NN'), ('a', 'AT'), ('r', 'NN'), ('r', 'NN'), ('y', 'NN'), (' ', 'NN'), ('P', 'NN'), ('o', 'NN'), ('t', 'NN'), ('t', 'NN'), ('e', 'NN'), ('r', 'NN'), ('.', '.')]

How do I stop it from splitting the words into letters and tagging the letters instead of the word?

Solution

Tag tag() function expects a list of tokens as input. Since you give it a string as input, this string gets interpreted as a list. Turning a string into a list gives you a list of characters:

>>> list("abc")
['a', 'b', 'c']

All you need to do is turn your string into a list of tokens before tagging. For example with nltk or simply by splitting at whitespaces:

>>> import nltk
>>> nltk.word_tokenize("Hi I am Harry Potter.")
['Hi', 'I', 'am', 'Harry', 'Potter', '.']
>>> "Hi I am Harry Potter.".split(' ')
['Hi', 'I', 'am', 'Harry', 'Potter.']

Adding tokenization in the tagging gives the following result:

print(tagger.tag(nltk.word_tokenize("Hi I am Harry Potter.")))
[('Hi', 'NN'), ('I', 'PPSS'), ('am', 'VB'), ('Harry', 'NN'), ('Potter', 'NN'), ('.', '.')]