I know that for k-cross validation, I'm supposed to divide the corpus into k equal parts. Of these k parts, I'm to use k-1 parts for training and the remaining 1 part for testing. This process is to be repeated k times, such that each part is used once for testing.
But I don't understand what exactly does training mean and what exactly does testing mean .
What I think is (please correct me if I'm wrong):
1. Training sets (k-1 out of k): These sets are to be used build to the Tag transition probabilities and Emission probabilities tables. And then, apply some algorithm for tagging using these probability tables (Eg. Viterbi Algorithm)
2. Test set (1 set): Use the remaining 1 set to validate the implementation done in step 1. That is, this set from the corpus will have untagged words and I should use the step 1 implementation on this set.
Is my understanding correct? Please explain if not.
Thanks.
I hope this helps:
from nltk.corpus import brown
from nltk import UnigramTagger as ut
# Let's just take the first 100 sentences.
sents = brown.tagged_sents()[:1000]
num_sents = len(sents)
k = 10
foldsize = num_sents/10
fold_accurracies = []
for i in range(10):
# Locate the test set in the fold.
test = sents[i*foldsize:i*foldsize+foldsize]
# Use the rest of the sent not in test for training.
train = sents[:i*foldsize] + sents[i*foldsize+foldsize:]
# Trains a unigram tagger with the train data.
tagger = ut(train)
# Evaluate the accuracy using the test data.
accuracy = tagger.evaluate(test)
print "Fold", i
print 'from sent', i*foldsize, 'to', i*foldsize+foldsize
print 'accuracy =', accuracy
print
fold_accurracies.append(accuracy)
print 'average accuracy =', sum(fold_accurracies)/k
[out]:
Fold 0
from sent 0 to 100
accuracy = 0.785714285714
Fold 1
from sent 100 to 200
accuracy = 0.745431364216
Fold 2
from sent 200 to 300
accuracy = 0.749628896586
Fold 3
from sent 300 to 400
accuracy = 0.743798291989
Fold 4
from sent 400 to 500
accuracy = 0.803448275862
Fold 5
from sent 500 to 600
accuracy = 0.779836277467
Fold 6
from sent 600 to 700
accuracy = 0.772676371781
Fold 7
from sent 700 to 800
accuracy = 0.755679184052
Fold 8
from sent 800 to 900
accuracy = 0.706402915148
Fold 9
from sent 900 to 1000
accuracy = 0.774622079707
average accuracy = 0.761723794252