Search code examples
pythonmachine-learningscikit-learn

How to scale train, validation and test sets properly using StandardScaler?


Some articles says that in case of having only train and test sets, first, we need to use fit_transform() to scale training set and then only transform() for test set, in order to prevent data leakage.

In my case, I have also validation set.

I think one of these codes below would be okay to use but I cannot rely on them completely. Any kind of help will be appreciated, thanks!

1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size = 2/7)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_val = scaler.transform(X_val)
X_test = scaler.transform(X_test)

2)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)

X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size = 2/7)
X_test = scaler.transform(X_test)

Solution

  • Generally you would want to use Option 1 code. The reason for using fit and then transform with train data is a) Fit would calculate mean,var etc of train set and then try to fit the model to data b) post which transform is going to convert data as per the fitted model.

    If you use fit again with test set this is going to add bias to your model.