Search code examples
pythonmachine-learningscikit-learnorange

get Classification accuracy on test data using previous saved model


I am using Orange data mining tool to write a python script to get classification accuracy on test data using a previous saved model(pickle file).

dataFile = "training.csv" 
data = Orange.data.Table(dataFile);
learner = Orange.classification.RandomForestLearner()
cf = learner(data)
#save the pickle file
with open("1.pkcls", "wb") as f:
    pickle.dump(cf, f)

#load the pickle file
with open("1.pkcls", "rb") as f:
    loadCF = pickle.load(f)
testFile = "testing.csv" 
test = Orange.data.Table(testFile);

learners = [1]
learners[0] = cf
result = Orange.evaluation.testing.TestOnTestData(data,test,learners)
# get classification accuracy
CAs = Orange.evaluation.CA(result)

I can successfully save and load the model but I had an error

    CAs = Orange.evaluation.CA(result)


File "/Users/anaconda2/envs/py36/lib/python3.6/site-packages/Orange/evaluation/scoring.py", line 39, in __new__
    return self(results, **kwargs)
  File "/Users/anaconda2/envs/py36/lib/python3.6/site-packages/Orange/evaluation/scoring.py", line 48, in __call__
    return self.compute_score(results, **kwargs)
  File "/Users/anaconda2/envs/py36/lib/python3.6/site-packages/Orange/evaluation/scoring.py", line 84, in compute_score
    return self.from_predicted(results, skl_metrics.accuracy_score)
  File "/Users/anaconda2/envs/py36/lib/python3.6/site-packages/Orange/evaluation/scoring.py", line 75, in from_predicted
    dtype=np.float64, count=len(results.predicted))
  File "/Users/anaconda2/envs/py36/lib/python3.6/site-packages/Orange/evaluation/scoring.py", line 74, in <genexpr>
    for predicted in results.predicted),
  File "/Users/anaconda2/envs/py36/lib/python3.6/site-packages/sklearn/metrics/classification.py", line 172, in accuracy_score
    y_type, y_true, y_pred = _check_targets(y_true, y_pred)
  File "/Users/anaconda2/envs/py36/lib/python3.6/site-packages/sklearn/metrics/classification.py", line 82, in _check_targets
    "".format(type_true, type_pred))
ValueError: Can't handle mix of multiclass and continuous

I find a way to fix this problem and successfully generate the classification accuracy by deleting

cf = learner(data)

However, if I delete this line of code, I am unable to train a model and save it because RandomForestLearner does not train the model based on the input file before code of saving and loading model.

with open("1.pkcls", "wb") as f:
pickle.dump(cf, f)

#load the pickle file
with open("1.pkcls", "rb") as f:
loadCF = pickle.load(f)

Does anyone know if it is possible to train a model first and save it as a pickle file. Then I can use it to test another file to get classification accuracy later?


Solution

  • You must not pre-train the classifier before passing it to TestOnTestData (its name should be TrainOnTrainAndTestOnTestData, i.e. it invokes fitting/training step on its own).

    Unfortunately there is no readily available explicit way to create a Result instance from an application of a pre-trained classifier(s) on a test dataset.

    One quick and dirty way is to thunk the 'learners' passed to TestOnTest data to return the pre-trained models

    results = Orange.evaluation.testing.TestOnTestData(data, test, [lambda testdata: loadCF])