python-3.x machine-learning scikit-learn data-science

sklearn StandardScaler, doesn't allow direct transform, we need to fit_transform

What is the difference between fit_transform and transform? Why doesn't transform directly works?

from sklearn.preprocessing import StandardScaler

 X_scaler = StandardScaler()
 X_train = X_scaler.fit_transform(X_train)
 X_test = X_scaler.transform(X_test)

If directly transformed it gives the below error

NotFittedError: This StandardScaler instance is not fitted yet. Call 'fit' with appropriate arguments before using this method.

Solution

StandardScaler, as per documentation:

Standardize features by removing the mean and scaling to unit variance

So it needs to somehow first know about the mean and variance of your data. So fit() or fit_transform() is needed so that StandardScaler can go through all of your data to find the mean and variance. Those can be accessed by attributes:

mean_ : The mean value for each feature (column) in the training set.

var_ : The variance for each feature in the training set.

Note that those will be calculated separately for each column in the data.

In transform(), it will just use those mean and variance values to scale the data.

Now you might say that why just it don't calculate those attributes during transform(). This is done so that the test data is scaled in the same way as a training data is scaled (from fit_transform()). If you calculate mean and variance of data in each call to transform(), then all passed data will have different scale, which is not what you want.

This is true for all scikit transformers.

1) fit() - Will only go through the data and save all needed attributes of data

2) transform() - Use the saved attributes from fit() to change the data

3) fit_transform() - Utility function to fit() and then transform() the same data.

Usually you would call fit_transform() on training data, and only transform() on test data.