Search code examples
machine-learningnlpstanford-nlppos-tagger

What does k fold validation mean in the context of POS tagging?


I know that for k-cross validation, I'm supposed to divide the corpus into k equal parts. Of these k parts, I'm to use k-1 parts for training and the remaining 1 part for testing. This process is to be repeated k times, such that each part is used once for testing.

But I don't understand what exactly does training mean and what exactly does testing mean .

What I think is (please correct me if I'm wrong):
1. Training sets (k-1 out of k): These sets are to be used build to the Tag transition probabilities and Emission probabilities tables. And then, apply some algorithm for tagging using these probability tables (Eg. Viterbi Algorithm)
2. Test set (1 set): Use the remaining 1 set to validate the implementation done in step 1. That is, this set from the corpus will have untagged words and I should use the step 1 implementation on this set.

Is my understanding correct? Please explain if not.

Thanks.


Solution

  • I hope this helps:

    from nltk.corpus import brown
    from nltk import UnigramTagger as ut
    
    # Let's just take the first 100 sentences.
    sents = brown.tagged_sents()[:1000]
    num_sents = len(sents)
    k = 10
    foldsize = num_sents/10
    
    fold_accurracies = []
    
    for i in range(10):
        # Locate the test set in the fold.
        test = sents[i*foldsize:i*foldsize+foldsize]
        # Use the rest of the sent not in test for training.
        train = sents[:i*foldsize] + sents[i*foldsize+foldsize:]
        # Trains a unigram tagger with the train data.
        tagger = ut(train)
        # Evaluate the accuracy using the test data.
        accuracy = tagger.evaluate(test)
        print "Fold", i 
        print 'from sent', i*foldsize, 'to', i*foldsize+foldsize
        print 'accuracy =', accuracy 
        print
        fold_accurracies.append(accuracy)
    
    print 'average accuracy =', sum(fold_accurracies)/k
    

    [out]:

    Fold 0
    from sent 0 to 100
    accuracy = 0.785714285714
    
    Fold 1
    from sent 100 to 200
    accuracy = 0.745431364216
    
    Fold 2
    from sent 200 to 300
    accuracy = 0.749628896586
    
    Fold 3
    from sent 300 to 400
    accuracy = 0.743798291989
    
    Fold 4
    from sent 400 to 500
    accuracy = 0.803448275862
    
    Fold 5
    from sent 500 to 600
    accuracy = 0.779836277467
    
    Fold 6
    from sent 600 to 700
    accuracy = 0.772676371781
    
    Fold 7
    from sent 700 to 800
    accuracy = 0.755679184052
    
    Fold 8
    from sent 800 to 900
    accuracy = 0.706402915148
    
    Fold 9
    from sent 900 to 1000
    accuracy = 0.774622079707
    
    average accuracy = 0.761723794252