Search code examples
machine-learningtrain-test-split

Train/Test Datasets in Machine Learning


I just have a general question:

In a previous job, I was tasked with building a series of non-linear models to quantify the impact of certain factors on the number of medical claims filed. We had a set of variables we would use in all models (eg: state, year, Sex, etc.). We used all of our data to build these models; meaning we never split the data into training and test data sets.

If I were to go back in time to this job and split the data into training and test data sets, what would the advantages of that approach be besides assessing the prediction accuracy of our models. What is an argument for not splitting the data and then fitting the model? Never really thought about it too much until now - curious as to why we didn't take that approach.

Thanks!


Solution

  • The sole purpose of setting aside a test set is to assess prediction accuracy. However, there is more to this than just checking the number and thinking "huh, that's how my model performs"!

    Knowing how your model performs at a given moment gives you an important benchmark for potential improvements of the model. How will you know otherwise whether adding a feature increases model performance? Moreover, how do you know otherwise whether your model is at all better than mere random guessing? Sometimes, extremely simple models outperform the more complex ones.

    Another thing is removal of features or observations. This depends a bit on the kind of models you use, but some models (e.g., k-Nearest-Neighbors) perform significantly better if you remove unimportant features from the data. Similarly, suppose you add more training data and suddenly your model's test performance drops significantly. Perhaps there is something wrong with the new observations? You should be aware of these things.

    The only argument I can think of for not using a test set is that otherwise you'd have too little training data for the model to perform optimally.