Search code examples
pythonmachine-learningscikit-learnmachine-learning-model

Performance of ML model after StandardScaler transform on TEST data


Overview : I'm new to ML and learning sklearn preprocessing. I figured out that mean will not be 0 and std will not be 1 when we use sklearn preprocessing transform on TEST data (reason being we are using TRAIN data mean/std to standardize the test data).

My question : If the test data is Standardized in this way(not correctly standardized to Gaussian Normal Distribution with mean 0 and std 1), then will this effect the prediction of ML Algorithm? My understanding is that the ML prediction will have low accuracy, as we are giving the ML model an incorrectly standardized data.

Code screenshot for mean and std


Solution

  • What this should be telling you is that your training and test sets might have different distribution. If your training set is not representative of the global population (here represented by TEST data) then the model won't generalise that well.

    It's completely OK if your test data isn't centred around zero with 1 std. The point of this transform is to get all data in the same range, as otherwise number of algorithms would incorrectly (with respect to the user intention) update the model. By applying this transform you are saying "all features equally important".

    There's no such thing like "incorrectly standardized data" (the way you described), only training data not being representative.