Search code examples
pythontensorflowneural-networklosstf-slim

Tensorflow neural network loss value NaN


I'm trying to build a simple multilayer perceptron model on a large data set but I'm getting the loss value as nan. The weird thing is: after the first training step, the loss value is not nan and is about 46 (which is oddly low. when i run a logistic regression model, the first loss value is about ~3600). But then, right after that the loss value is constantly nan. I used tf.print to try and debug it as well.

The goal of the model is to predict ~4500 different classes - so it's a classification problem. When using tf.print, I see that after the first training step (or feed forward through MLP), the predictions coming out from the last fully connected layer seem right (all varying numbers between 1 and 4500). But then, after that the outputs from the last fully connected layer go to either all 0's or some other constant number (0 0 0 0 0).

For some information about my model:

  • 3 layer model. all fully connected layers.

  • batch size of 1000

  • learning rate of .001 (i also tried .1 and .01 but nothing changed)

  • using CrossEntropyLoss (i did add an epsilon value to prevent log0)

  • using AdamOptimizer

  • learning rate decay is .95

The exact code for the model is below: (I'm using the TF-Slim library)

input_layer = slim.fully_connected(model_input, 5000, activation_fn=tf.nn.relu)
hidden_layer = slim.fully_connected(input_layer, 5000, activation_fn=tf.nn.relu)
output = slim.fully_connected(hidden_layer, vocab_size, activation_fn=tf.nn.relu)
output = tf.Print(output, [tf.argmax(output, 1)], 'out = ', summarize = 20, first_n = 10)
return {"predictions": output}

Any help would be greatly appreciated! Thank you so much!


Solution

  • Two (possibly more) reasons why it doesn't work:

    1. You skipped or inappropriately applied feature scaling of your inputs and outputs. Consequently, data may be difficult to handle for Tensorflow.
    2. Using ReLu, which is a discontinuous function, may raise issues. Try using other activation functions, such as tanh or sigmoid.