I have the a dataframe called features and I scale the data as follows:
col_names=features.columns
scaler=StandardScaler()
scaler.fit(features)
standardized_features=scaler.transform(features)
standardized_features.shape
df=pd.DataFrame(data=standardized_features,columns=col_names)
Then I split training and test set as follows:
df_idx = df[df.Date == '1996-12-01'].index[0]
df_targets=df['Label'].values
df_features=df.drop(['Regime','Date','Label'], axis=1)
df_training_features = df.iloc[:df_idx,:].drop(['Regime','Date','Label'], axis=1)
df_validation_features = df.iloc[df_idx:, :].drop(['Regime','Date','Label'], axis=1)
df_training_targets = df['Label'].values
df_training_targets=df_training_targets[:df_idx]
df_validation_targets = df['Label'].values
df_validation_targets=df_validation_targets[df_idx:]
in the end I test different methods:
scoring='f1'
kfold = model_selection.TimeSeriesSplit(n_splits=5)
models = []
models.append(('LR', LogisticRegression(C=1e10, class_weight = 'balanced')))
models.append(('KNN', KNeighborsClassifier()))
models.append(('GB', GradientBoostingClassifier(random_state = 42)))
models.append(('ABC', AdaBoostClassifier(random_state = 42)))
models.append(('RF', RandomForestClassifier(class_weight = 'balanced')))
models.append(('XGB', xgb.XGBClassifier(objective='binary:logistic', booster='gbtree')))
results = []
names = []
lb = preprocessing.LabelBinarizer()
for name, model in models:
cv_results = model_selection.cross_val_score(estimator = model, X = df_training_features,
y = lb.fit_transform(df_training_targets), cv=kfold, scoring = scoring)
model.fit(df_training_features, df_training_targets) # train the model
fpr, tpr, thresholds= metrics.roc_curve(df_training_targets,model.predict_proba(df_training_features)[:,1])
auc = metrics.roc_auc_score(df_training_targets,model.predict(df_training_features))
plt.plot(fpr, tpr, label='%s ROC (area = %0.2f)' % (name, auc))
results.append(cv_results)
names.append(name)
msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
print(msg)
My questions are:
You must fit you StandScaler with only training data. Then, with this standardization you transform the training data a and the validation data.
This is done in order to keep the same standarization with all inputs data. Imagine in your training data you have the next values of an attribute: [0,1,2]
. If you make a simple normalization (it's similar as standarization) you will obtain something like: [0, 0.5, 1]
.
Now imagine in your validation you have also 3 samples, where one category has the next values [0, 1, 100]
. If you fit and transform, you will [0, 0.01, 1]
. This is a disaster, because the model is trained thinking that your 1 is 0.5 scaled. That's why you transform your validation data with training data information.