How to use the confusion matrix module in NLTK?

I followed the NLTK book in using the confusion matrix but the confusionmatrix looks very odd.

#empirically exam where tagger is making mistakes
test_tags = [tag for sent in brown.sents(categories='editorial')
    for (word, tag) in t2.tag(sent)]
gold_tags = [tag for (word, tag) in brown.tagged_words(categories='editorial')]
print nltk.ConfusionMatrix(gold_tags, test_tags)

Can anyone explain how to use the confusion matrix?

Solution

Firstly, I assume that you got the code from old NLTK's chapter 05: https://nltk.googlecode.com/svn/trunk/doc/book/ch05.py, particularly you're look at this section: http://pastebin.com/EC8fFqLU

Now, let's look at the confusion matrix in NLTK, try:

from nltk.metrics import ConfusionMatrix
ref  = 'DET NN VB DET JJ NN NN IN DET NN'.split()
tagged = 'DET VB VB DET NN NN NN IN DET NN'.split()
cm = ConfusionMatrix(ref, tagged)
print cm

[out]:

    | D         |
    | E I J N V |
    | T N J N B |
----+-----------+
DET |<3>. . . . |
 IN | .<1>. . . |
 JJ | . .<.>1 . |
 NN | . . .<3>1 |
 VB | . . . .<1>|
----+-----------+
(row = reference; col = test)

The numbers embedded in <> are the true positives (tp). And from the example above, you see that one of the JJ from reference was wrongly tagged as NN from the tagged output. For that instance, it counts as one false positive for NN and one false negative for JJ.

To access the confusion matrix (for calculating precision/recall/fscore), you can access the false negatives, false positives and true positives by:

labels = set('DET NN VB IN JJ'.split())

true_positives = Counter()
false_negatives = Counter()
false_positives = Counter()

for i in labels:
    for j in labels:
        if i == j:
            true_positives[i] += cm[i,j]
        else:
            false_negatives[i] += cm[i,j]
            false_positives[j] += cm[i,j]

print "TP:", sum(true_positives.values()), true_positives
print "FN:", sum(false_negatives.values()), false_negatives
print "FP:", sum(false_positives.values()), false_positives

[out]:

TP: 8 Counter({'DET': 3, 'NN': 3, 'VB': 1, 'IN': 1, 'JJ': 0})
FN: 2 Counter({'NN': 1, 'JJ': 1, 'VB': 0, 'DET': 0, 'IN': 0})
FP: 2 Counter({'VB': 1, 'NN': 1, 'DET': 0, 'JJ': 0, 'IN': 0})

To calculate Fscore per label:

for i in sorted(labels):
    if true_positives[i] == 0:
        fscore = 0
    else:
        precision = true_positives[i] / float(true_positives[i]+false_positives[i])
        recall = true_positives[i] / float(true_positives[i]+false_negatives[i])
        fscore = 2 * (precision * recall) / float(precision + recall)
    print i, fscore

[out]:

DET 1.0
IN 1.0
JJ 0
NN 0.75
VB 0.666666666667

I hope the above will de-confuse the confusion matrix usage in NLTK, here's the full code for the example above:

from collections import Counter
from nltk.metrics import ConfusionMatrix

ref  = 'DET NN VB DET JJ NN NN IN DET NN'.split()
tagged = 'DET VB VB DET NN NN NN IN DET NN'.split()
cm = ConfusionMatrix(ref, tagged)

print cm

labels = set('DET NN VB IN JJ'.split())

true_positives = Counter()
false_negatives = Counter()
false_positives = Counter()

for i in labels:
    for j in labels:
        if i == j:
            true_positives[i] += cm[i,j]
        else:
            false_negatives[i] += cm[i,j]
            false_positives[j] += cm[i,j]

print "TP:", sum(true_positives.values()), true_positives
print "FN:", sum(false_negatives.values()), false_negatives
print "FP:", sum(false_positives.values()), false_positives
print 

for i in sorted(labels):
    if true_positives[i] == 0:
        fscore = 0
    else:
        precision = true_positives[i] / float(true_positives[i]+false_positives[i])
        recall = true_positives[i] / float(true_positives[i]+false_negatives[i])
        fscore = 2 * (precision * recall) / float(precision + recall)
    print i, fscore