machine-learning nlp stanford-nlp pos-tagger

What does k fold validation mean in the context of POS tagging?

I know that for k-cross validation, I'm supposed to divide the corpus into k equal parts. Of these k parts, I'm to use k-1 parts for training and the remaining 1 part for testing. This process is to be repeated k times, such that each part is used once for testing.

But I don't understand what exactly does training mean and what exactly does testing mean .

What I think is (please correct me if I'm wrong):
1. Training sets (k-1 out of k): These sets are to be used build to the Tag transition probabilities and Emission probabilities tables. And then, apply some algorithm for tagging using these probability tables (Eg. Viterbi Algorithm)
2. Test set (1 set): Use the remaining 1 set to validate the implementation done in step 1. That is, this set from the corpus will have untagged words and I should use the step 1 implementation on this set.

Is my understanding correct? Please explain if not.

Thanks.

Solution

I hope this helps:

from nltk.corpus import brown
from nltk import UnigramTagger as ut

# Let's just take the first 100 sentences.
sents = brown.tagged_sents()[:1000]
num_sents = len(sents)
k = 10
foldsize = num_sents/10

fold_accurracies = []

for i in range(10):
    # Locate the test set in the fold.
    test = sents[i*foldsize:i*foldsize+foldsize]
    # Use the rest of the sent not in test for training.
    train = sents[:i*foldsize] + sents[i*foldsize+foldsize:]
    # Trains a unigram tagger with the train data.
    tagger = ut(train)
    # Evaluate the accuracy using the test data.
    accuracy = tagger.evaluate(test)
    print "Fold", i 
    print 'from sent', i*foldsize, 'to', i*foldsize+foldsize
    print 'accuracy =', accuracy 
    print
    fold_accurracies.append(accuracy)

print 'average accuracy =', sum(fold_accurracies)/k

[out]:

Fold 0
from sent 0 to 100
accuracy = 0.785714285714

Fold 1
from sent 100 to 200
accuracy = 0.745431364216

Fold 2
from sent 200 to 300
accuracy = 0.749628896586

Fold 3
from sent 300 to 400
accuracy = 0.743798291989

Fold 4
from sent 400 to 500
accuracy = 0.803448275862

Fold 5
from sent 500 to 600
accuracy = 0.779836277467

Fold 6
from sent 600 to 700
accuracy = 0.772676371781

Fold 7
from sent 700 to 800
accuracy = 0.755679184052

Fold 8
from sent 800 to 900
accuracy = 0.706402915148

Fold 9
from sent 900 to 1000
accuracy = 0.774622079707

average accuracy = 0.761723794252