I am using python version 3.4.1 and NLTK version 3 and I am trying to use their Brill Tagger.
Here is the training code for the brill tagger:
import nltk
from nltk.tag.brill import *
import nltk.tag.brill_trainer as bt
from nltk.corpus import brown
Template._cleartemplates()
templates = fntbl37()
tagged_sentences = brown.tagged_sents(categories = 'news')
tagged_sentences = tagged_sentences[:]
tagger = nltk.tag.BigramTagger(tagged_sentences)
tagger = bt.BrillTaggerTrainer(tagger, templates, trace=3)
tagger = tagger.train(tagged_sentences, max_rules=250)
print(tagger.evaluate(brown.tagged_sents(categories='fiction')[:]))
print(tagger.tag("Hi I am Harry Potter."))
The output to the last command however is:
[('H', 'NN'), ('i', 'NN'), (' ', 'NN'), ('I', 'NN'), (' ', 'NN'), ('a', 'AT'), ('m', 'NN'), (' ', 'NN'), ('H', 'NN'), ('a', 'AT'), ('r', 'NN'), ('r', 'NN'), ('y', 'NN'), (' ', 'NN'), ('P', 'NN'), ('o', 'NN'), ('t', 'NN'), ('t', 'NN'), ('e', 'NN'), ('r', 'NN'), ('.', '.')]
How do I stop it from splitting the words into letters and tagging the letters instead of the word?
Tag tag()
function expects a list of tokens as input.
Since you give it a string as input, this string gets interpreted as a list.
Turning a string into a list gives you a list of characters:
>>> list("abc")
['a', 'b', 'c']
All you need to do is turn your string into a list of tokens before tagging. For example with nltk or simply by splitting at whitespaces:
>>> import nltk
>>> nltk.word_tokenize("Hi I am Harry Potter.")
['Hi', 'I', 'am', 'Harry', 'Potter', '.']
>>> "Hi I am Harry Potter.".split(' ')
['Hi', 'I', 'am', 'Harry', 'Potter.']
Adding tokenization in the tagging gives the following result:
print(tagger.tag(nltk.word_tokenize("Hi I am Harry Potter.")))
[('Hi', 'NN'), ('I', 'PPSS'), ('am', 'VB'), ('Harry', 'NN'), ('Potter', 'NN'), ('.', '.')]