Search code examples
pythontensorflowmachine-learningkeras

Loss increasing to extremely high numbers during training


I'm trying to fit my Tensorflow model on my Macbook Pro (M1). This model works completely fine on my other system, running Ubuntu on WSL2 with the same python version, where the loss steadily decreases to around 0.05, but somehow when I run it on my Mac, the loss numbers increase to ridiculously large numbers, at one point reaching 200 trillion. This is not specific to this model; I tried it on another model as well with the same result.

Here's the code for my model:

model = tf.keras.models.Sequential([tf.keras.layers.Conv2D(16,(6,6),activation = 'relu', input_shape=(212,212,3)),
    tf.keras.layers.MaxPool2D(2,2),
    tf.keras.layers.Conv2D(32,(5,5),activation = 'relu'),
    tf.keras.layers.MaxPool2D(2,2),
    tf.keras.layers.Dropout(0.1),                            
    tf.keras.layers.Conv2D(64,(3,3),activation = 'relu'),
    tf.keras.layers.MaxPool2D(2,2),
    tf.keras.layers.Conv2D(128,(3,3),activation = 'relu'),
    tf.keras.layers.MaxPool2D(2,2),
    tf.keras.layers.Dropout(0.1),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(512, activation='relu'),
    tf.keras.layers.Dense(256, activation='relu'),
    tf.keras.layers.Dense(2, activation='softmax')
])

summary = model.summary()
tf.keras.utils.plot_model(model, to_file="model_plot3.png", show_shapes=True, show_layer_names=True)
checkpoint = ModelCheckpoint("best_model.h5", monitor = "val_loss", verbose = 0, save_best_only = True, mode = "auto")
model.compile(loss="categorical_crossentropy", optimizer=Adam(learning_rate=0.001), metrics=["accuracy"])
hist = model.fit(train_generator, epochs = 30, validation_data=valid_generator, callbacks = [checkpoint])

Here's the output:

WARNING:absl:At this time, the v2.11+ optimizer `tf.keras.optimizers.Adam` runs slowly on M1/M2 Macs, please use the legacy Keras optimizer instead, located at `tf.keras.optimizers.legacy.Adam`.
WARNING:absl:There is a known slowdown when using v2.11+ Keras optimizers on M1/M2 Macs. Falling back to the legacy Keras optimizer, i.e., `tf.keras.optimizers.legacy.Adam`.
Epoch 1/30
2023-08-02 15:01:30.003906: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.
400/400 [==============================] - ETA: 0s - loss: 0.4036 - accuracy: 0.81422023-08-02 15:02:18.532815: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.
/Users/williamli/miniconda3/envs/kg2/lib/python3.11/site-packages/keras/src/engine/training.py:3000: UserWarning: You are saving your model as an HDF5 file via `model.save()`. This file format is considered legacy. We recommend using instead the native Keras format, e.g. `model.save('my_model.keras')`.
  saving_api.save_model(
400/400 [==============================] - 65s 161ms/step - loss: 0.4036 - accuracy: 0.8142 - val_loss: 0.6669 - val_accuracy: 0.7453
Epoch 2/30
400/400 [==============================] - 70s 174ms/step - loss: 0.8683 - accuracy: 0.8273 - val_loss: 2.7992 - val_accuracy: 0.7753
Epoch 3/30
400/400 [==============================] - 57s 144ms/step - loss: 609.9819 - accuracy: 0.7615 - val_loss: 2737.7434 - val_accuracy: 0.8334
Epoch 4/30
400/400 [==============================] - 82s 206ms/step - loss: 1593462.3750 - accuracy: 0.6877 - val_loss: 8623920.0000 - val_accuracy: 0.4119
Epoch 5/30
400/400 [==============================] - 83s 206ms/step - loss: 122928104.0000 - accuracy: 0.6355 - val_loss: 40522244.0000 - val_accuracy: 0.8847
Epoch 6/30
400/400 [==============================] - 74s 184ms/step - loss: 1330869888.0000 - accuracy: 0.6237 - val_loss: 982521344.0000 - val_accuracy: 0.8397
Epoch 7/30
400/400 [==============================] - 80s 200ms/step - loss: 7602212864.0000 - accuracy: 0.6055 - val_loss: 7827622912.0000 - val_accuracy: 0.5456
Epoch 8/30
400/400 [==============================] - 81s 202ms/step - loss: 26032189440.0000 - accuracy: 0.6040 - val_loss: 42498854912.0000 - val_accuracy: 0.4313
Epoch 9/30
400/400 [==============================] - 66s 165ms/step - loss: 47032242176.0000 - accuracy: 0.6094 - val_loss: 9544238080.0000 - val_accuracy: 0.8656
Epoch 10/30
400/400 [==============================] - 67s 168ms/step - loss: 128274128896.0000 - accuracy: 0.6000 - val_loss: 208268017664.0000 - val_accuracy: 0.2744
Epoch 11/30
400/400 [==============================] - 59s 146ms/step - loss: 240008052736.0000 - accuracy: 0.5925 - val_loss: 75725373440.0000 - val_accuracy: 0.7259
Epoch 12/30
400/400 [==============================] - 42s 104ms/step - loss: 301967835136.0000 - accuracy: 0.6059 - val_loss: 407859658752.0000 - val_accuracy: 0.4909
Epoch 13/30
400/400 [==============================] - 55s 137ms/step - loss: 486205947904.0000 - accuracy: 0.6050 - val_loss: 1384417722368.0000 - val_accuracy: 0.4391
Epoch 14/30
400/400 [==============================] - 43s 107ms/step - loss: 1128794685440.0000 - accuracy: 0.5770 - val_loss: 735326240768.0000 - val_accuracy: 0.5809
Epoch 15/30
400/400 [==============================] - 44s 109ms/step - loss: 1682992136192.0000 - accuracy: 0.5792 - val_loss: 269883539456.0000 - val_accuracy: 0.8350
Epoch 16/30
400/400 [==============================] - 42s 106ms/step - loss: 1772778946560.0000 - accuracy: 0.5947 - val_loss: 345572409344.0000 - val_accuracy: 0.8394
Epoch 17/30
400/400 [==============================] - 41s 104ms/step - loss: 1817638862848.0000 - accuracy: 0.5984 - val_loss: 3161192660992.0000 - val_accuracy: 0.5013
Epoch 18/30
400/400 [==============================] - 42s 104ms/step - loss: 3075231186944.0000 - accuracy: 0.5902 - val_loss: 367501869056.0000 - val_accuracy: 0.8988
Epoch 19/30
400/400 [==============================] - 42s 104ms/step - loss: 3854084079616.0000 - accuracy: 0.5852 - val_loss: 2003185041408.0000 - val_accuracy: 0.5578
Epoch 20/30
400/400 [==============================] - 42s 104ms/step - loss: 5827094118400.0000 - accuracy: 0.5778 - val_loss: 1107054166016.0000 - val_accuracy: 0.8131
Epoch 21/30
400/400 [==============================] - 42s 104ms/step - loss: 7197869211648.0000 - accuracy: 0.5852 - val_loss: 14864055533568.0000 - val_accuracy: 0.1506
Epoch 22/30
400/400 [==============================] - 42s 104ms/step - loss: 10607804809216.0000 - accuracy: 0.5769 - val_loss: 4783043248128.0000 - val_accuracy: 0.6831
Epoch 23/30
400/400 [==============================] - 42s 104ms/step - loss: 14316861390848.0000 - accuracy: 0.5797 - val_loss: 4704773865472.0000 - val_accuracy: 0.7094
Epoch 24/30
400/400 [==============================] - 42s 104ms/step - loss: 18032501981184.0000 - accuracy: 0.5738 - val_loss: 34486202925056.0000 - val_accuracy: 0.0291
Epoch 25/30
400/400 [==============================] - 42s 104ms/step - loss: 17980863807488.0000 - accuracy: 0.5898 - val_loss: 5609672409088.0000 - val_accuracy: 0.7353
Epoch 26/30
400/400 [==============================] - 42s 104ms/step - loss: 29986731851776.0000 - accuracy: 0.5722 - val_loss: 85421698580480.0000 - val_accuracy: 0.0000e+00
Epoch 27/30
400/400 [==============================] - 42s 104ms/step - loss: 42101488222208.0000 - accuracy: 0.5700 - val_loss: 61102549368832.0000 - val_accuracy: 0.2134
Epoch 28/30
400/400 [==============================] - 42s 104ms/step - loss: 38928790847488.0000 - accuracy: 0.5879 - val_loss: 36724038172672.0000 - val_accuracy: 0.2400
Epoch 29/30
400/400 [==============================] - 42s 104ms/step - loss: 59314614042624.0000 - accuracy: 0.5813 - val_loss: 205399596728320.0000 - val_accuracy: 0.3913
Epoch 30/30
400/400 [==============================] - 42s 104ms/step - loss: 72781760823296.0000 - accuracy: 0.5737 - val_loss: 191339199201280.0000 - val_accuracy: 0.2197
(8141, 39)
(132, 39)
Found 260 validated image filenames belonging to 2 classes.
Test loss: 87099814445056.0
Test accuracy: 0.5230769515037537

I've tried restarting my computer and reinstalling tensorflow. I also made a pretty simple model that runs on the CIFAR-10 dataset, where the same thing happens, although not to this level.

Does anyone know why this is happening? I'm on Python 3.11.4, tensorflow 2.13.0, and tensorflow-metal 1.0.1.

I saw this SO post, but I do have MaxPool2D layers, and I'm working on an image classification problem.


Solution

  • I have a similar issue on my Apple Silicon M2 Max.

    The Keras folks confirmed this is a problem.

    It will be fixed in TensorFlow 2.15

    For now, you can use tf-nightly which doesn't have the same issue.

    pip install tf-nightly

    https://github.com/keras-team/keras/issues/18370