Search code examples
pythondecision-treeorange

How to stratify data using Orange?


Looking for some help from the Orange experts out there.

I have a data set of about 6 million lines. For simplicity's sake, we'll consider only two columns. One is of positive decimal numbers and is imported as a continuous value. The other is of discrete values (either 0 or 1) where there is a ratio of 30:1 for 1's to 0's.

I am using a classification tree (which I label as 'learner') to get the classifier. I'm then trying to do a cross-validation on my data set while adjusting for the overwhelming 30:1 sample bias. I've tried several variations to do this but continue to get the same result regardless of whether I stratify the data or not.

Below is my code and I've commented out the various lines I've tried (using both True and False values for stratification):

import Orange
import os
import time
import operator

start = time.time()
print "Starting"
print ""

mydata = Orange.data.Table("testData.csv")

# This is used only for the test_with_indices method below
indicesCV = Orange.data.sample.SubsetIndicesCV(mydata)

# I only want the highest level classifier so max_depth=1
learner = Orange.classification.tree.TreeLearner(max_depth=1)

# These are the lines I've tried:
#res = Orange.evaluation.testing.cross_validation([learner], mydata, folds=5, stratified=True)
#res = Orange.evaluation.testing.proportion_test([learner], mydata, 0.8, 100, store_classifiers=1)
res = Orange.evaluation.testing.proportion_test([learner], mydata, learning_proportion=0.8, times=10, stratification=True, store_classifiers=1)
#res = Orange.evaluation.testing.test_with_indices([learner], mydata, indicesCV)

f = open('results.txt', 'a')
divString = "\n##### RESULTS (" + time.strftime("%Y-%m-%d %H:%M:%S") + ") #####"
f.write(divString)
f.write("\nAccuracy:     %.2f" %  Orange.evaluation.scoring.CA(res)[0])
f.write("\nPrecision:    %.2f" % Orange.evaluation.scoring.Precision(res)[0])
f.write("\nRecall:       %.2f" % Orange.evaluation.scoring.Recall(res)[0])
f.write("\nF1:           %.2f\n" % Orange.evaluation.scoring.F1(res)[0])

tree = learner(mydata)

f.write(tree.to_string(leaf_str="%V (%M out of %N)"))
print tree.to_string(leaf_str="%V (%M out of %N)")

end = time.time()
print "Ending"
timeStr = "Execution time: " + str((end - start) / 60) + " minutes"
f.write(timeStr)

f.close()

Note: There may seem like there are syntax errors (stratified vs. stratification) but the program runs as-is without exceptions. Also, I know the documentation shows stuff like stratified=StratifiedIfPossible but for some reason, only boolean values work for me.


Solution

  • I don't see where do you adjust for the 30:1 bias. If by stratification: no, stratification means the opposite (in some sense) of what you want: stratified sample is a sample in which the class distribution is roughly the same as in the "population". Therefore, by stratified=True, you tell Orange to make sure it keeps the 30:1 bias. If you don't stratify, the sample distribution may be randomly off a bit.

    You probably wanted to do something along these lines:

    # First, split the table into two tables with rows from different classes:
    
    filt = Orange.data.filter.SameValue()
    filt.position = -1
    filt.value = 0
    class0 = filt(mydata)
    filt.value = 1
    class1 = filt(mydata)
    
    # Now class0 and class1 contain the rows from class 0 and 1, respectively
    # Take 100 rows from each table:
    
    sampler = Orange.data.sample.SubsetIndices2(p0=100)
    ind = sampler(class0)
    samp0 = class0.select(ind, 0)
    ind = sampler(class1)
    samp1 = class1.select(ind, 0)
    
    # samp0 and samp1 now contain 100 rows from each class
    # We have to merge them into a single table
    
    balanced_data = Orange.data.Table(samp0)
    balanced_data.extend(samp1)
    

    After this, balanced_data will have 1:1 class ratio.

    This may now be exactly what you want, though: this classifier will prefer the minority class too much, so its performance will be really bad. In my experience, you want to lower the 30:1 ratio, but not by too much.