Correlation-based loss function for sequence labelling in Keras

I have a question concerning the implementation of a correlation-based loss function for a sequence labelling task in Keras (Tensorflow backend).

Consider we have a sequence labelling problem, e.g., the input is a tensor of shape (20,100,5), the output is a tensor of shape (20,100,1). In the documentation it is written that, the loss function needs to return a "scalar for each data point". What the default MSE loss does for the loss between tensors of shape (20,100,1) is to return a loss tensor of shape (20,100).

Now, if we use a loss function based on the correlation coefficient for each sequence, in theory, we will get only a single value for each sequence, i.e., a tensor of shape (20,).

However, using this in Keras as a loss function, fit() returns an error as a tensor of shape (20,100) is expected. On the other side, there is no error when I either

Return just the mean value of the tensor (a single scalar for the whole data), or
Repeat the tensor (using K.repeat_elements) ending up in a tensor of shape (20,100).

The framework does not return an error (Tensorflow backend) and the loss is reduced over epochs, also on independent test data, the performance is good.

My questions are:

Which dimensionality of the targets/losses does the "fit" function usually assume in case of sequences?
Is the Tensorflow backend able to derive the gradients properly also with only the mean value returned?

Please find below an executable example with my implementations of correlation-based loss functions. my_loss_1 returns only the mean value of the correlation coefficients of all (20) sequences. my_loss_2 returns only one loss for each sequence (does not work in a real training). my_loss_3 repeats the loss for each sample within each sequence.

Many thanks and best wishes

from keras import backend as K
from keras.losses import mean_squared_error

import numpy as np
import tensorflow as tf


def my_loss_1(seq1, seq2):  # Correlation-based loss function - version 1 - return scalar
    seq1        = K.squeeze(seq1, axis=-1)
    seq2        = K.squeeze(seq2, axis=-1)
    seq1_mean   = K.mean(seq1, axis=-1, keepdims=True)
    seq2_mean   = K.mean(seq2, axis=-1, keepdims=True)
    nominator   = K.sum((seq1-seq1_mean) * (seq2-seq2_mean), axis=-1)
    denominator = K.sqrt( K.sum(K.square(seq1-seq1_mean), axis=-1) * K.sum(K.square(seq2-seq2_mean), axis=-1) )
    corr        = nominator / (denominator + K.common.epsilon())
    corr_loss   = K.constant(1.) - corr
    corr_loss   = K.mean(corr_loss)
    return corr_loss

def my_loss_2(seq1, seq2):  # Correlation-based loss function - version 2 - return 1D array
    seq1        = K.squeeze(seq1, axis=-1)
    seq2        = K.squeeze(seq2, axis=-1)
    seq1_mean   = K.mean(seq1, axis=-1, keepdims=True)
    seq2_mean   = K.mean(seq2, axis=-1, keepdims=True)
    nominator   = K.sum((seq1-seq1_mean) * (seq2-seq2_mean), axis=-1)
    denominator = K.sqrt( K.sum(K.square(seq1-seq1_mean), axis=-1) * K.sum(K.square(seq2-seq2_mean), axis=-1) )
    corr        = nominator / (denominator + K.common.epsilon())
    corr_loss   = K.constant(1.) - corr
    return corr_loss

def my_loss_3(seq1, seq2):  # Correlation-based loss function - version 3 - return 2D array
    seq1        = K.squeeze(seq1, axis=-1)
    seq2        = K.squeeze(seq2, axis=-1)
    seq1_mean   = K.mean(seq1, axis=-1, keepdims=True)
    seq2_mean   = K.mean(seq2, axis=-1, keepdims=True)
    nominator   = K.sum((seq1-seq1_mean) * (seq2-seq2_mean), axis=-1)
    denominator = K.sqrt( K.sum(K.square(seq1-seq1_mean), axis=-1) * K.sum(K.square(seq2-seq2_mean), axis=-1) )
    corr        = nominator / (denominator + K.common.epsilon())
    corr_loss   = K.constant(1.) - corr
    corr_loss   = K.reshape(corr_loss, (-1,1))
    corr_loss   = K.repeat_elements(corr_loss, K.int_shape(seq1)[1], 1)  # Does not work for fit(). It seems that NO dimension may be None in order to get a value!=None from int_shape().
    return corr_loss


# Test
sess = tf.Session()

# input (20,100,1)
a1 = np.random.rand(20,100,1)
a2 = np.random.rand(20,100,1)
print('\nInput: ' + str(a1.shape))

p1 = K.placeholder(shape=a1.shape, dtype=tf.float32)
p2 = K.placeholder(shape=a1.shape, dtype=tf.float32)

loss0 = mean_squared_error(p1,p2)
print('\nMSE:')                      # output: (20,100)
print(sess.run(loss0, feed_dict={p1: a1, p2: a2}))

loss1 = my_loss_1(p1,p2)
print('\nCorrelation coefficient:')  # output: ()
print(sess.run(loss1, feed_dict={p1: a1, p2: a2}))

loss2 = my_loss_2(p1,p2)
print('\nCorrelation coefficient:')  # output: (20,)
print(sess.run(loss2, feed_dict={p1: a1, p2: a2}))

loss3 = my_loss_3(p1,p2)
print('\nCorrelation coefficient:')  # output: (20,100)
print(sess.run(loss3, feed_dict={p1: a1, p2: a2}))

Solution

Now, if we use a loss function based on the correlation coefficient for each sequence, in theory, we will get only a single value for each sequence, i.e., a tensor of shape (20,).

That's not true. the coefficient is something like

average((avg_label - label_value)(average_prediction - prediction_value)) / 
        (var(label_value)*var(prediction_value))

Remove the overall average and you are left the componenets of the correlation coefficient, per element of the sequence, which is the right shape. You can plug in other correlation formulas as well, just stop before computing the single value.