What is the difference between fit_transform and transform? Why doesn't transform directly works?
from sklearn.preprocessing import StandardScaler
X_scaler = StandardScaler()
X_train = X_scaler.fit_transform(X_train)
X_test = X_scaler.transform(X_test)
If directly transformed it gives the below error
NotFittedError: This StandardScaler instance is not fitted yet. Call 'fit' with appropriate arguments before using this method.
StandardScaler
, as per documentation:
Standardize features by removing the mean and scaling to unit variance
So it needs to somehow first know about the mean and variance of your data.
So fit()
or fit_transform()
is needed so that StandardScaler
can go through all of your data to find the mean and variance. Those can be accessed
by attributes:
mean_ : The mean value for each feature (column) in the training set.
var_ : The variance for each feature in the training set.
Note that those will be calculated separately for each column in the data.
In transform()
, it will just use those mean
and variance
values to scale the data.
Now you might say that why just it don't calculate those attributes during transform()
. This is done so that the test data is scaled in the same way as a training data is scaled (from fit_transform()
). If you calculate mean and variance of data in each call to transform()
, then all passed data will have different scale, which is not what you want.
This is true for all scikit transformers.
1) fit()
- Will only go through the data and save all needed attributes of data
2) transform()
- Use the saved attributes from fit()
to change the data
3) fit_transform()
- Utility function to fit()
and then transform()
the same data.
Usually you would call fit_transform()
on training data, and only transform()
on test data.