How would I use the scikit-learn MinMaxScaler
to standardize every column in a pandas data-frame training data set, but use the exact same standard deviation, min/max formula on my test data set?
Since my testing data is unknown to the model, I dont want to standardize the whole data set, it would not be an accurate model for future unknown data. Instead I would like to standardize the data between 0 & 1 using the training set, and use the same std, min and max numbers for the formula on the test data.
(Obviously I can write my own min-max scaler, but wondering if scikit-learn can do this already or if there is a library I can use for this first)
You should be able to fit
it on your training data and then transform
your test data:
scaler = MinMaxScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train) # or: fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
Your approach now seems like good practice. If you were to call fit
on your entire X matrix (train and test combined), you'd be causing information leakage as your training data would have "seen" the scale of your test data beforehand. Using a class-based implementation of MinMaxScaler()
is how sklearn addresses this specifically, allowing the object to "remember" attributes of the data on which it was fit.
However, be aware that MinMaxScaler()
does not scale to ~N(0, 1). In fact, it is explicitly billed as an alternative to this scaling. In other words, it does not guarantee you unit variance or 0 mean at all. In fact, it really doesn't care about standard deviation as it's defined in the traditional sense.
From the docstring:
The transformation is given by:
X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0)) X_scaled = X_std * (max_ - min_) + min_
Where min_
and max_
are equal to your unpacked feature_range
(default (0, 1)
) from the __init__
of MinMaxScaler()
. Manually this is:
def scale(a):
# implicit feature_range=(0,1)
return (a - X_train.min(axis=0)) / (X_train.max(axis=0) - X_train.min(axis=0))
So say you had: import numpy as np from sklearn.model_selection import train_test_split
np.random.seed(444)
X = np.random.normal(loc=5, scale=2, size=(200, 3))
y = np.random.normal(loc=-5, scale=3, size=X.shape[0])
X_train, X_test, y_train, y_test = train_test_split(
X, y, random_state=444)
If you were to call
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
Know that scaler.scale_
is not standard deviation of the data on which you did the fitting.
scaler.scale_
# array([ 0.0843, 0.0852, 0.0876])
X_train.std(axis=0)
# array([ 2.042 , 2.0767, 2.1285])
Instead, it is:
(1 - 0) / (X_train.max(axis=0) - X_train.min(axis=0))
# array([ 0.0843, 0.0852, 0.0876])