Search code examples
python-3.xscikit-learncross-validation

fit or fit_transform if I used StandardScaler on the entire dataset?


I have the a dataframe called features and I scale the data as follows:


col_names=features.columns

scaler=StandardScaler()
scaler.fit(features)
standardized_features=scaler.transform(features)
standardized_features.shape
df=pd.DataFrame(data=standardized_features,columns=col_names)

Then I split training and test set as follows:

df_idx = df[df.Date == '1996-12-01'].index[0]
df_targets=df['Label'].values
df_features=df.drop(['Regime','Date','Label'], axis=1)

df_training_features = df.iloc[:df_idx,:].drop(['Regime','Date','Label'], axis=1)
df_validation_features = df.iloc[df_idx:, :].drop(['Regime','Date','Label'], axis=1)

df_training_targets = df['Label'].values
df_training_targets=df_training_targets[:df_idx]

df_validation_targets = df['Label'].values
df_validation_targets=df_validation_targets[df_idx:]

in the end I test different methods:

scoring='f1' 
kfold = model_selection.TimeSeriesSplit(n_splits=5) 
models = []

models.append(('LR', LogisticRegression(C=1e10, class_weight = 'balanced')))
models.append(('KNN', KNeighborsClassifier()))
models.append(('GB', GradientBoostingClassifier(random_state = 42)))
models.append(('ABC', AdaBoostClassifier(random_state = 42)))
models.append(('RF', RandomForestClassifier(class_weight = 'balanced')))
models.append(('XGB', xgb.XGBClassifier(objective='binary:logistic', booster='gbtree')))

results = []
names = []
lb = preprocessing.LabelBinarizer()

for name, model in models:
    cv_results = model_selection.cross_val_score(estimator = model, X = df_training_features, 
                                                 y = lb.fit_transform(df_training_targets), cv=kfold, scoring = scoring)
    
    model.fit(df_training_features, df_training_targets) # train the model

    fpr, tpr, thresholds= metrics.roc_curve(df_training_targets,model.predict_proba(df_training_features)[:,1])
    auc = metrics.roc_auc_score(df_training_targets,model.predict(df_training_features))
    plt.plot(fpr, tpr, label='%s ROC (area = %0.2f)' % (name, auc))
    results.append(cv_results)
    names.append(name)
    msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
    print(msg)

My questions are:

  • if initially I scaled my data with StandardScaler, is it correct that in the last part I use fit_transform rather than fit as a y argument to model_selection.cross_val_score? Why?
  • for the prediction, should I simply use model.predict(df_validation_features)?

Solution

  • You must fit you StandScaler with only training data. Then, with this standardization you transform the training data a and the validation data.

    This is done in order to keep the same standarization with all inputs data. Imagine in your training data you have the next values of an attribute: [0,1,2]. If you make a simple normalization (it's similar as standarization) you will obtain something like: [0, 0.5, 1].

    Now imagine in your validation you have also 3 samples, where one category has the next values [0, 1, 100]. If you fit and transform, you will [0, 0.01, 1]. This is a disaster, because the model is trained thinking that your 1 is 0.5 scaled. That's why you transform your validation data with training data information.