machine-learning nltk stanford-nlp text-classification naivebayes

How to train a naive bayes classifier with pos-tag sequence as a feature?

I have two classes of sentences. Each has reasonably distinct pos-tag sequence. How can I train a Naive-Bayes classifier with POS-Tag sequence as a feature? Does Stanford CoreNLP/NLTK (Java or Python) provide any method for building a classifier with pos-tag as a feature? I know in python NaiveBayesClassifier allows for building a NB classifier but it uses contains-a-word as feature but can it be extended to use pos-tag-sequence as a feature ?

Solution

If you know how to train and predict texts (or sentences in your case) using nltk's naive bayes classifier and words as features, than you can easily extend this approach in order to classify texts by pos-tags. This is because the classifier don't care about whether your feature-strings are words or tags. So you can simply replace the words of your sentences by pos-tags using for example nltk's standard pos tagger:

sent = ['So', 'they', 'have', 'internet', 'on', 'computers' , 'now']
tags = [t for w, t in nltk.pos_tag(sent)]
print tags

['IN', 'PRP', 'VBP', 'JJ', 'IN', 'NNS', 'RB']

As from now you can proceed with the "contains-a-word" approach.