I am building an MLPClassifier model in sci-kit learn. I used gridSearchCV with roc_auc to score the model. Mean train and test scores are around 0.76, not bad. The output of cv_results_
is:
Train set AUC: 0.553465272412
Grid best score (AUC): 0.757236688092
Grid best parameter (max. AUC): {'hidden_layer_sizes': 10}
{ 'mean_fit_time': array([63.54, 136.37, 136.32, 119.23, 121.38, 124.03]),
'mean_score_time': array([ 0.04, 0.04, 0.04, 0.05, 0.05, 0.06]),
'mean_test_score': array([ 0.76, 0.74, 0.75, 0.76, 0.76, 0.76]),
'mean_train_score': array([ 0.76, 0.76, 0.76, 0.77, 0.77, 0.77]),
'param_hidden_layer_sizes': masked_array(data = [5 (5, 5) (5, 10) 10 (10, 5) (10, 10)],
mask = [False False False False False False],
fill_value = ?)
,
'params': [ {'hidden_layer_sizes': 5},
{'hidden_layer_sizes': (5, 5)},
{'hidden_layer_sizes': (5, 10)},
{'hidden_layer_sizes': 10},
{'hidden_layer_sizes': (10, 5)},
{'hidden_layer_sizes': (10, 10)}],
'rank_test_score': array([ 2, 6, 5, 1, 4, 3]),
'split0_test_score': array([ 0.76, 0.75, 0.75, 0.76, 0.76, 0.76]),
'split0_train_score': array([ 0.76, 0.75, 0.75, 0.76, 0.76, 0.76]),
'split1_test_score': array([ 0.77, 0.76, 0.76, 0.77, 0.76, 0.76]),
'split1_train_score': array([ 0.76, 0.75, 0.75, 0.76, 0.76, 0.76]),
'split2_test_score': array([ 0.74, 0.72, 0.73, 0.74, 0.74, 0.75]),
'split2_train_score': array([ 0.77, 0.77, 0.77, 0.77, 0.77, 0.77]),
'std_fit_time': array([47.59, 1.29, 1.86, 3.43, 2.49, 9.22]),
'std_score_time': array([ 0.01, 0.01, 0.01, 0.00, 0.00, 0.01]),
'std_test_score': array([ 0.01, 0.01, 0.01, 0.01, 0.01, 0.01]),
'std_train_score': array([ 0.01, 0.01, 0.01, 0.01, 0.01, 0.00])}
As you can see I use a KFold of 3. Interestingly the roc_auc_score of the train set computed manually is reported as 0.55, while the mean train score is reported as ~0.76. The code to generate this output is:
def model_mlp (X_train, y_train, verbose=True, random_state = 42):
grid_values = {'hidden_layer_sizes': [(5), (5,5), (5, 10),
(10), (10, 5), (10, 10)]}
# MLP requires scaling of all predictors
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
mlp = MLPClassifier(solver='adam', learning_rate_init=1e-4,
max_iter=200,
verbose=False,
random_state=random_state)
# perform the grid search
grid_auc = GridSearchCV(mlp,
param_grid=grid_values,
scoring='roc_auc',
verbose=2, n_jobs=-1)
grid_auc.fit(X_train, y_train)
y_hat = grid_auc.predict(X_train)
# print out the results
if verbose:
print('Train set AUC: ', roc_auc_score(y_train, y_hat))
print('Grid best score (AUC): ', grid_auc.best_score_)
print('Grid best parameter (max. AUC): ', grid_auc.best_params_)
print('')
pp = pprint.PrettyPrinter(indent=4)
pp.pprint (grid_auc.cv_results_)
print ('MLPClassifier fitted, {:.2f} seconds used'.format (time.time () - t))
return grid_auc.best_estimator_
Because of this difference I decided to 'emulate' the GridSearchCV
routine and got the following results:
Shape X_train: (107119, 15)
Shape y_train: (107119,)
Shape X_val: (52761, 15)
Shape y_val: (52761,)
layers roc-auc
Seq l1 l2 train test iters runtime
1 5 0 0.5522 0.5488 85 20.54
2 5 5 0.5542 0.5513 80 27.10
3 5 10 0.5544 0.5521 83 28.56
4 10 0 0.5532 0.5516 61 15.24
5 10 5 0.5540 0.5518 54 19.86
6 10 10 0.5507 0.5474 56 21.09
The scores are all around 0.55, consistent with the manual computation in the code above. What surprised me is the lack of variation in the results. It appears as if I am making some mistake, but I cannot find one, see the code:
def simple_mlp (X, y, verbose=True, random_state = 42):
def do_mlp (X_t, X_v, y_t, y_v, n, l1, l2=None):
if l2 is None:
layers = (l1)
l2 = 0
else:
layers = (l1, l2)
t = time.time ()
mlp = MLPClassifier(solver='adam', learning_rate_init=1e-4,
hidden_layer_sizes=layers,
max_iter=200,
verbose=False,
random_state=random_state)
mlp.fit(X_t, y_t)
y_hat_train = mlp.predict(X_t)
y_hat_val = mlp.predict(X_v)
if verbose:
av = 'samples'
acc_trn = roc_auc_score(y_train, y_hat_train, average=av)
acc_tst = roc_auc_score(y_val, y_hat_val, average=av)
print ("{:5d}{:4d}{:4d}{:7.4f}{:7.4f}{:9d}{:8.2f}"
.format(n, l1, l2, acc_trn, acc_tst, mlp.n_iter_, time.time() - t))
return mlp, n + 1
X_train, X_val, y_train, y_val = train_test_split (X, y, test_size=0.33, random_state=random_state)
if verbose:
print('Shape X_train:', X_train.shape)
print('Shape y_train:', y_train.shape)
print('Shape X_val:', X_val.shape)
print('Shape y_val:', y_val.shape)
# MLP requires scaling of all predictors
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_val = scaler.transform(X_val)
n = 1
layers1 = [5, 10]
layers2 = [5, 10]
if verbose:
print (" layers roc-auc")
print (" Seq l1 l2 train validation iters runtime")
for l1 in layers1:
mlp, n = do_mlp (X_train, X_val, y_train, y_val, n, l1)
for l2 in layers2:
mlp, n = do_mlp (X_train, X_val, y_train, y_val, n, l1, l2)
return mlp
I use exactly the same data in both cases (159880 observations and 15 predictors). I use cv=3
(the default) for GridSearchCV
and use the same proportion for the validation set in my handcrafted code.
When searching for a possible answer I found this post on SO which describes the same problem. There was no answer. Maybe someone understands what exactly is happening?
Thanks for your time.
Edit
I checked the code of GridSearchCV and KFold as @Mohammed Kashif suggested and indeed found an explicit remark that KFold did not shuffle the data. So I added the following code to model_mlp before the scaler:
np.random.seed (random_state)
index = np.random.permutation (len(X_train))
X_train = X_train.iloc[index]
and into simple_mlp as a replacement of train_test_split:
np.random.seed (random_state)
index = np.random.permutation (len(X))
X = X.iloc[index]
y = y.iloc[index]
train_size = int (2 * len(X) / 3.0) # sample of 2 third
X_train = X[:train_size]
X_val = X[train_size:]
y_train = y[:train_size]
y_val = y[train_size:]
Which resulted in the following output:
Train set AUC: 0.5
Grid best score (AUC): 0.501410198106
Grid best parameter (max. AUC): {'hidden_layer_sizes': (5, 10)}
{ 'mean_fit_time': array([28.62, 46.00, 54.44, 46.74, 55.25, 53.33]),
'mean_score_time': array([ 0.04, 0.05, 0.05, 0.05, 0.05, 0.06]),
'mean_test_score': array([ 0.50, 0.50, 0.50, 0.50, 0.50, 0.50]),
'mean_train_score': array([ 0.50, 0.51, 0.51, 0.51, 0.50, 0.51]),
'param_hidden_layer_sizes': masked_array(data = [5 (5, 5) (5, 10) 10 (10, 5) (10, 10)],
mask = [False False False False False False],
fill_value = ?)
,
'params': [ {'hidden_layer_sizes': 5},
{'hidden_layer_sizes': (5, 5)},
{'hidden_layer_sizes': (5, 10)},
{'hidden_layer_sizes': 10},
{'hidden_layer_sizes': (10, 5)},
{'hidden_layer_sizes': (10, 10)}],
'rank_test_score': array([ 6, 2, 1, 4, 5, 3]),
'split0_test_score': array([ 0.50, 0.50, 0.51, 0.50, 0.50, 0.50]),
'split0_train_score': array([ 0.50, 0.51, 0.50, 0.51, 0.50, 0.51]),
'split1_test_score': array([ 0.50, 0.50, 0.50, 0.50, 0.49, 0.50]),
'split1_train_score': array([ 0.50, 0.50, 0.51, 0.50, 0.51, 0.51]),
'split2_test_score': array([ 0.49, 0.50, 0.49, 0.50, 0.50, 0.50]),
'split2_train_score': array([ 0.51, 0.51, 0.51, 0.51, 0.50, 0.51]),
'std_fit_time': array([19.74, 19.33, 0.55, 0.64, 2.36, 0.65]),
'std_score_time': array([ 0.01, 0.01, 0.00, 0.01, 0.00, 0.01]),
'std_test_score': array([ 0.01, 0.00, 0.01, 0.00, 0.00, 0.00]),
'std_train_score': array([ 0.00, 0.00, 0.00, 0.00, 0.00, 0.00])}
which appears to confirm Mohammeds remarks. I must say I was quite sceptical at first as I could not imagine such a strong impact of randomization on such big a dataset that does not really looks like ordered.
I have some doubts however. In the original setup GridSearchCV came out consistently too high by about 0.20, now it is consistently too low by about 0.05. This is an improvement as the deviation of both methods has decreased by a factor 4. Is there an explanation of the last finding or is a deviation between both methods by about 0.05 simply a fact of noise? I decided to mark this as the correct answer, but I hope somebody can shed some light upon my little doubt.
The difference in score is mainly due to the different ways of splitting the datasets by GridSearchCV
and your function that emulates it. Think of it this way . Suppose you have 9 data points in your dataset. Now in GridSearchCV with 3 folds suppose the distribution is like this :
train_cv_fold1_indices : 1 2 3 4 5 6
test_cv_fold1_indices : 7 8 9
train_cv_fold2_indices : 1 2 3 7 8 9
test_cv_fold2_indices : 4 5 6
train_cv_fold3_indices : 4 5 6 7 8 9
test_cv_fold3_indices : 1 2 3
However your function that emulates the GridSearchCV might be splitting the data in a different way, say for example :
train_indices : 1 3 5 7 8 9
test_indices : 2 4 6
Now as you can see, this a different split on the dataset and hence the classifier trained on it might behave quite differently. (It might even behave the same, it all depends on the data points and various other factors like how relevant the are, do they help to check the variation among data points, etc.).
So, in order to perfectly emulate GridSearchCV, you would need to perform the splits in the same way.
Check the GridSearchCV Source and you will find out that at line no 592, that inorder to perform CV, they call another function from check_cv
specified at this link. It actually calls either Kfold CV or startified CV.
So based on your experiments I would suggest explicitly perform CV on your dataset using a fixed random seed and the functions mentioned above( either Kfold CV or startified CV). Then use the same CV object in your emulating function to get a more comparable analysis. Then you might get more relatable values.