python tensorflow keras deep-learning loss-function

Loss function in Keras/TensorFlow

My intent is to implement a custom loss function for training a model in Keras with TensorFlow as backend.

Loss function

W and H represent, respectively, the width and height of the softmax layer’s output, and N is the batch size. The variable p is the probability predicted by the FCN for the correct class.

This loss function is from this paper.

In this implementation, N is 4, W is 200 and H is 400. The output shape of the final layer is (None, 400, 200, 2). A single label's shape is (400, 200, 2) where each channel represents a class.

So far,

A Numpy implementation:

Even though this is not useful in this context, this is what I wanted to implement as a loss function.

def loss_using_np(y_true, y_pred):
    '''
    Assuming, `y_true` and `y_pred` shape is (400, 200, 2).
    This might change to (None, 400, 200, 2) while training in batch?
    '''
    dx = 0.0000000000000001 # Very small value to avoid -infinity while taking log
    y_pred = y_pred + dx
    class_one_pred = y_pred[:, :, 0]
    class_two_pred = y_pred[:, :, 1]
    class_one_mask = y_true[:, :, 0] == 1.0
    class_two_mask = y_true[:, :, 1] == 1.0
    class_one_correct_prob_sum = np.sum(np.log(class_one_pred[class_one_mask]))
    class_two_correct_prob_sum = np.sum(np.log(class_two_pred[class_two_mask]))
    N = 4
    H = 400
    W = 200
    return -1 * ((class_one_correct_prob_sum + class_two_correct_prob_sum) / ( N * H * W))

Above implementation gives an expected output; bad that it cannot be used.

y_true = np.random.randint(2, size=(400, 200, 2))
y_pred = np.random.random((400, 200, 2))
loss_using_np(y_true, y_pred)

Failed try 01

import tensorflow as tf # not a good practice to not use keras.backend?
def loss_function(y_true, y_pred):
    # Not a working solution as it raises
    # ResourceExhaustedError: OOM when allocating tensor with shape[311146,3,400,2] BUT WHY?
    N = 4 # batch size
    W = 200
    H = 400
    dx = 0.0000000000000001
    y_pred = tf.add(y_pred, dx)
    class_one_gt = y_true[:,:,:,0]
    class_one_mask = tf.where(tf.equal(class_one_gt, 1.0))
    # Bad to use `tf.gather`. Issues warning,
    #`Converting sparse IndexedSlices to a dense Tensor of unknown shape.`
    class_one_prob_sum = keras.backend.sum(keras.backend.log(tf.gather(y_pred[:,:,:,0], class_one_mask)))
    class_two_gt = y_true[:,:,:,1]
    class_two_mask = tf.where(tf.equal(class_two_gt, 1.0))
    class_two_prob_sum = keras.backend.sum(keras.backend.log(tf.gather(y_pred[:,:,1], class_two_mask)))
    print("This will be printed only once; won't be printed everytime loss is callculated. How to log?")
    return -1 * ((class_one_prob_sum + class_two_prob_sum)/ (N * W * H))

Failed try 02?

def loss_function(y_true, y_pred):
    N = 4
    H = 400
    W = 200
    dx = tf.constant(0.0000000000000001, dtype=tf.float32)
    correct_probs = tf.boolean_mask(y_pred, tf.equal(y_true, 1.0))
    correct_probs = tf.add(correct_probs, dx)
    return (-1 * keras.backend.sum(keras.backend.log(correct_probs))) /(N * H * W)

For this #02 approach I'm getting a warning,

UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "

Questions

Can you tell me how to implement this loss function without any warning? I'm not confident that #02 is the right implementation. I'm looking for an optimized solution. Any help or pointers is much appreciated.
I tried to understand what's happening inside loss_function() using print statements but, they are printed once while I compile the model. Is there any way we can log this?

As mentioned by @dennis-ec, one can use tf.Print() for debugging.

Side note

I'm using Keras 2.1.4 with TensorFlow 1.4.0-rc1 and Python 3.5.2.

Solution

To me, it seems like the authors are using a vanilla binary cross-entropy loss for multi-label classification. They also name it as such, but their definition is a bit odd compared to how you would implement it in Keras.

Basically, you could use binary_crossentropy as a loss function and supply the labels as arrays of shape (400, 200, 1) where a 0 denotes the first class and a 1 denotes the second class. The output of your network would then be of the same shape, with sigmoid activation functions at each output node. This is how semantic segmentation models are usually implemented in Keras. See this repo for an example:

# final layer, sigmoid activations
conv10 = Conv2D(1, 1, activation = 'sigmoid')(conv9)
model = Model(input = inputs, output = conv10)
# binary_crossentropy loss for multi-label classification
model.compile(optimizer = Adam(lr = 1e-4), loss = 'binary_crossentropy', metrics = ['accuracy'])

This should give exactly the same result as with the implementation defined in the paper (they did probably not use Keras).