Search code examples
data-sciencecross-validation

test and train good practice wrt summary feature


When one feature of a dataset is a summary statistic of the entire pool of data, is it good practice to include the train data in your test data in order to calculate the feature for validation?

For instance, let's say I have 1000 data points split into 800 entries of training and 200 entries for validation. I create a feature with the 800 entries for training of say rank quartile (or could be anything), which numbers 0-3 the quartile some other feature falls in. So in the training set, there will be 200 data points in each quartile.

Once you train the model and need to calculate the feature again for the validation set, a) do you use the already set quartiles barriers, ie the 200 validation entries could have a different than 50-50-50-50 quartile split, or b) do you recalculate the quartiles using all 1000 entries so there is a new feature of quartile rank, each of 250 entries each?

Thanks very much


Solution

  • The ideal practice would be to calculate the quartiles on the training dataset, and using those barriers on your holdout / validation dataset. To ensure that you correctly generate model diagnostics to evaluate its predictive performance, you do not want the distribution of the testing dataset to influence your model training. This is because that data will not be available in real life when you apply the model on unseen data.

    I also thought that you will find this article extremely useful when thinking about train-test splitting - https://towardsdatascience.com/3-things-you-need-to-know-before-you-train-test-split-869dfabb7e50