Search code examples
pythonmachine-learningcallbackkerascheckpoint

Why evaluation of saved model by using ModelCheckpoint is different from results in training history?


My code is following:

from keras.models import Sequential
from keras.layers import Dense, Flatten, Dropout
from keras.utils import np_utils
from keras.callbacks import ModelCheckpoint
import numpy as np

best_weights_filepath = './best_weights.hdf5'

labels = np.array([1, 2]) # 0 - num_classes - 1
y_train = np_utils.to_categorical(labels, 3)
X_train = np.array([[[1, 2], [3, 4]], [[1, 2], [3, 4]]])

model = Sequential()
model.add(Flatten(input_shape=X_train.shape[1:]))
model.add(Dropout(0.2))
model.add(Dense(64))
model.add(Dropout(0.15))
model.add(Dense(32))
model.add(Dense(3, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='sgd')
mcp = ModelCheckpoint(best_weights_filepath, monitor="loss",
                      save_best_only=True)
hist = model.fit(X_train, y_train, 32, 50, callbacks=[mcp])                      
print(hist.history)
model.load_weights(best_weights_filepath)
evaluation = model.evaluate(X_train, y_train)
print(evaluation)

History:

{'loss': [0.88553774356842041, 1.3095510005950928, 1.0029082298278809, 0.93805015087127686, 0.91467124223709106, 1.2132010459899902, 1.0659240484237671, 0.70151412487030029, 1.1300414800643921, 0.94646221399307251, 0.85309064388275146, 0.79526293277740479, 0.70288115739822388, 1.1289818286895752, 0.87788408994674683, 0.63794469833374023, 0.92958927154541016, 0.63434022665023804, 0.26608449220657349, 1.133800745010376, 0.45052343606948853, 0.29425695538520813, 1.3438365459442139, 1.6920032501220703, 1.1263372898101807, 0.78767621517181396, 1.8708134889602661, 0.39164793491363525, 1.9281209707260132, 0.56522297859191895, 0.97685378789901733, 0.73725700378417969, 0.55782550573348999, 1.0230169296264648, 0.63401424884796143, 0.27007108926773071, 1.3010811805725098, 0.58272790908813477, 0.62068361043930054, 0.85791635513305664, 1.2364600896835327, 0.55607849359512329, 1.382312536239624, 1.0019338130950928, 0.24319441616535187, 0.76683026552200317, 0.99913954734802246, 0.57584917545318604, 0.78851628303527832, 1.8757588863372803]}

Evaluation of saved model:

0.698137879372

I wonder why evaluation of saved best model is different from the best loss in history?

Additional info:

I tried save information about iteration and loss with this code:

mcp = ModelCheckpoint(filepath='./{epoch:d}_{loss:.5f}.hdf5', monitor="loss",
                      save_best_only=True)

And have the next files:

0_1.71130.hdf5 2_0.39069.hdf5 17_0.25475.hdf5 20_0.15824.hdf5

Which are correspond to training output:

Epoch 21/50 2/2 [==============================] - 0s - loss: 0.1582

But after loading of the best model:

best_weights_filepath = "20_0.15824.hdf5"
model.load_weights(best_weights_filepath)
evaluation = model.evaluate(X_train, y_train)
print(evaluation)

Result:

0.792584061623

Update by suggestions from Josef Korbel:

Check with shuffle = False. I changed this line of code:

hist = model.fit(X_train, y_train, 32, 50, callbacks=[mcp], shuffle = False) 

History:

{'loss': [1.0125206708908081, 0.1452154815196991, 0.51181155443191528, 0.56420713663101196, 0.84724342823028564, 1.1929426193237305, 0.29997271299362183, 0.75090807676315308, 0.85906744003295898, 1.2877860069274902, 1.8168995380401611, 0.25087261199951172, 0.67293435335159302, 0.036234244704246521, 1.5076791048049927, 0.87120181322097778, 0.68330782651901245, 2.0751430988311768, 0.82240021228790283, 0.60692423582077026, 0.37373599410057068, 0.3232136070728302, 0.80889785289764404, 0.096551664173603058, 0.37592190504074097, 0.72723108530044556, 0.21966041624546051, 1.0940688848495483, 0.68471181392669678, 0.68382972478866577, 0.5214000940322876, 0.82752323150634766, 0.12418889999389648, 0.079014614224433899, 0.27435758709907532, 0.25825804471969604, 1.3681017160415649, 1.7907644510269165, 0.39580270648002625, 1.4243916273117065, 0.14836907386779785, 0.3069019615650177, 1.4323314428329468, 0.42189797759056091, 0.047193970531225204, 0.47303882241249084, 0.62194353342056274, 0.284626305103302, 1.8536494970321655, 0.73895668983459473]}

Evaluation of saved model:

0.356727153063

Best file:

43_0.19047.hdf5

Evaluation after loading this file:

0.373612910509

Check with validation

Code with validation:

from keras.models import Sequential
from keras.layers import Dense, Flatten, Dropout
from keras.utils import np_utils
from keras.callbacks import ModelCheckpoint
import numpy as np

best_weights_filepath = './best_weights.hdf5'

train_labels = np.array([1, 2]) # 0 - num_classes - 1
y_train = np_utils.to_categorical(train_labels, 3)
X_train = np.array([[[1, 2], [3, 4]], [[2, 1], [4, 3]]])

test_labels = np.array([0, 2]) # 0 - num_classes - 1
y_test = np_utils.to_categorical(test_labels, 3)
X_test = np.array([[[2, 2], [3, 3]], [[1, 1], [4, 4]]])

model = Sequential()
model.add(Flatten(input_shape=X_train.shape[1:]))
model.add(Dropout(0.2))
model.add(Dense(64))
model.add(Dropout(0.15))
model.add(Dense(32))
model.add(Dense(3, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='sgd')
mcp = ModelCheckpoint(filepath=best_weights_filepath, monitor='val_loss',
                      save_best_only=True)
hist = model.fit(X_train, y_train, validation_data=(X_test, y_test), nb_epoch=50, callbacks=[mcp], shuffle = False)
print(hist.history)
model.load_weights(best_weights_filepath)
evaluation = model.evaluate(X_test, y_test)
print(evaluation)
print(model.metrics_names)

History:

{'loss': [3.4101266860961914, 2.2727742195129395, 0.82779181003570557, 1.3179346323013306, 1.5904533863067627, 0.60796171426773071, 0.93778908252716064, 1.5920863151550293, 0.9363548755645752, 0.77552896738052368, 0.87378394603729248, 2.1034069061279297, 0.40709391236305237, 0.87646675109863281, 0.072320356965065002, 0.70467042922973633, 0.89934390783309937, 0.26884844899177551, 0.87511622905731201, 0.40567696094512939, 1.6750704050064087, 0.37005302309989929, 0.36293312907218933, 0.94361913204193115, 0.19056390225887299, 1.3764189481735229, 0.25876694917678833, 0.55998247861862183, 1.0649962425231934, 2.1643946170806885, 0.2727261483669281, 1.2005348205566406, 1.0628913640975952, 1.572542667388916, 0.22350168228149414, 0.37423995137214661, 0.7491459846496582, 0.51720428466796875, 0.86196297407150269, 0.72071665525436401, 0.7442132830619812, 0.83153235912322998, 0.045838892459869385, 0.037082117050886154, 0.68096923828125, 0.35572469234466553, 1.4226186275482178, 0.40259963274002075, 0.4162265956401825, 0.29243966937065125], 'val_loss': [2.0877130031585693, 1.3081772327423096, 1.0912094116210937, 1.4002015590667725, 1.1119445562362671, 1.2372562885284424, 1.4829056262969971, 1.3195570707321167, 1.6970505714416504, 1.8137892484664917, 2.6280913352966309, 1.6495449542999268, 1.9247033596038818, 1.8289017677307129, 1.9001308679580688, 1.7850335836410522, 1.903494119644165, 1.8801615238189697, 1.8557041883468628, 1.901431679725647, 2.1235334873199463, 2.1267158985137939, 2.1307065486907959, 2.3799698352813721, 2.6747565269470215, 2.5206508636474609, 2.3310909271240234, 2.6511917114257812, 2.4436931610107422, 2.560744047164917, 2.5082297325134277, 2.3821530342102051, 2.4538085460662842, 2.5820655822753906, 2.5825791358947754, 2.8093762397766113, 2.5358507633209229, 2.4986701011657715, 3.152174711227417, 2.7431669235229492, 2.841381311416626, 2.5363466739654541, 2.5489804744720459, 2.5466430187225342, 2.577369213104248, 2.679440975189209, 2.5890841484069824, 2.7041923999786377, 2.6547081470489502, 2.6690154075622559]}

Evaluation of saved model:

1.09120941162

It looks like it works for validation

Check with files:

4_2.19177.hdf5

Evaluation after loading this file:

2.19177055359


Solution

  • Because you are monitoring loss, that means, loss on training data set. Loss on validation data set is called val_loss. I dont know if this is the actual code you provided, but you should not evaluate on the same dataset that you've trained on. It might not generalize anything, just start to remember input data and overfit terribly, especially on small datasets.

    Why is the evaluation worse than best saved training loss? That is because that loss array is computed at the end of each epoch, if you have shuffle=True, every epochbatch will be in different order so gradients will be computed differently. That can make this difference. Evaluation on the other hand process the whole set at once (in batch_size batches), but again, dont use evaluation on the same dataset, you will have a hard time figuring out your network's accuracy that way.