swift machine-learning swift-playground coreml createml

Evaluation Accuracy is Different When Using Split Table Versus Completely Separate Table (CreateML, Swift)

I am creating a tabular classification model using CreateML and Swift. The dataset I am using has about 300 total items, and about 13 different features. I have tried training/testing my model in two ways and have had surprisingly very different outcomes:

1) Splitting my training and evaluation data table randomly from the original full data set:

let (classifierEvaluationTable, classifierTrainingTable) = classifierTable.randomSplit(by: 0.1, seed: 4)

I have played around a bit with the .1 split number and 4 seed number but the results are all over the place: Could be 33% or 80% evaluation accuracy in some cases. (I got 78% training accuracy, 83% validation accuracy, 75% evaluation accuracy in this case.)

2) I manually took 10 items from the original data set and put them into a new data set to test later. I then removed these items from the 300 item data set which was used for training. When I tested these 10 items, I got 96% evaluation accuracy. (I got 98% training accuracy, 71% validation accuracy, 96% evaluation accuracy in this case.)

I am wondering why is there such a big difference? Which reading should be seen as more realistic and credible? Is there anything I can do to either model to improve accuracy and credibility? ALSO: I am confused as to what the different accuracy measurements mean and how I should interpret them (training, validation, evaluation)?

Thanks.

Solution

The meaning of training/validation accuracy is that if the latter is lower than the former, your model is overfitting, i.e. too much adapted for the training set and cannot generalize properly.

So your 1st case yielded good result, and 2nd yielded bad result.

Evaluation accuracy is low when the new (unseen) data you are feeding your model is substantially different in some aspect (which perhaps can be solved by preprocessing or perhaps by adding it to training set and retraining the model).

In 2nd case your model is severely overfitting and the 10 items were taken from the training set so they are not substantially different, which obviously gives you a high evaluation accuracy. So it was a rather useless test.

It is not clear where did you get data for evaluation accuracy test in the 1st case.

TL;DR: 1st case is good result, 2nd is bad result. If testing on new data yields too low evaluation accuracy, perhaps data is qualitatively different.

Yet a different way of saying it: if validation accuracy is lower than training accuracy, your model is quantitatively bad (overfitting); if your evaluation accuracy is low, your model is qualitatively bad/unsuitable for the data you intend to use it for.