Search code examples
machine-learningartificial-intelligencedata-processing

Machine Learning data preprocessing


I have a question regarding data preprocessing for machine learning. Specifically transforming the data so it has zero mean and unit variance. I have split my data into two datasets (I know I should have three, but for the sake of simplicity let's just say I have two). Should I transform my training data set so that the entire training data set has unit variance and zero mean and then when testing the model transform each test input vector so that each particular test input vector presents unit variance and zero mean, or I should just transform the entire dataset (traning and testing) together so that the whole thing presents unit var and zero mean? My belief is that I should do the former that way I won't be introducing a despicable amount of bias into the test data set. But I am no expert, thus my question.


Solution

  • Fitting your preprocessor should only be done on the training-set and the mean and variance transformers are then used on the test-set. Computing these statistics on train and test leaks some information about the test-set.

    Let me link you to a good course on Deep-Learning and show you a citation (both from Andrej Karpathy):

    Common pitfall. An important point to make about the preprocessing is that any preprocessing statistics (e.g. the data mean) must only be computed on the training data, and then applied to the validation / test data. E.g. computing the mean and subtracting it from every image across the entire dataset and then splitting the data into train/val/test splits would be a mistake. Instead, the mean must be computed only over the training data and then subtracted equally from all splits (train/val/test).