Custom Loss Function Leads to High MSE and an offset in the output Keras

I am training a neural network for time series regression. The model is

####################################################################################################################
# Define ANN Model
# define two sets of inputs
acc   =  layers.Input(shape=(3,1,))
gyro  =  layers.Input(shape=(3,1,))

# the first branch operates on the first input
x = Conv1D(256, 1, activation='relu')(acc)
x = Conv1D(128, 1, activation='relu')(x)
x = Conv1D(128, 1, activation='relu')(x)
x = MaxPooling1D(pool_size=3)(x)
x = Model(inputs=acc, outputs=x)

# the second branch opreates on the second input
y = Conv1D(256, 1, activation='relu')(gyro)
y = Conv1D(128, 1, activation='relu')(y)
y = Conv1D(128, 1, activation='relu')(y)
y = MaxPooling1D(pool_size=3)(y)
y = Model(inputs=gyro, outputs=y)


# combine the output of the three branches
combined =  layers.concatenate([x.output,  y.output])

# combined outputs
z = Bidirectional(LSTM(128, dropout=0.25, return_sequences=False,activation='tanh'))(combined)
z = Reshape((256,1),input_shape=(128,))
z = Bidirectional(LSTM(128, dropout=0.25, return_sequences=False,activation='tanh'))(combined)

#z = Dense(10, activation="relu")(z)
z = Flatten()(z)
z = Dense(4, activation="linear")(z)
model = Model(inputs=[x.input, y.input], outputs=z)
model.compile(loss=loss, optimizer = tf.keras.optimizers.Adam(),metrics=['mse'],run_eagerly=True)

I have tried to implement a custom loss function (based on different papers).

Math

The error will calculated as follows:

y_pred = [w x y z]
y_true = [w1 x1 y1 z1]
error = 2 * acos(w*w1 + x*x1 + y*y1 + z*z1)

Based on this formula I wrote the custom loss function:

def loss(y_true, y_pred):
    z = y_true * (y_pred )
    wtot = tf.reduce_sum(z,axis=1)
    error = 2*tf.math.acos(K.clip(tf.math.sqrt(wtot*wtot), -1.,1.))
    return error

But while the loss value is decreasing the MSE increased and I can see an offset in the output which will grow by the number of epochs. I understand that we do not optimize this Network for MSE but based on mathematics the MSE must be reduced or converge to some value near 1.

Orange is the Target/Reference

Blue is the Network ouptut

for 1 epoch

for 10 epochs

for 50 epochs

Solution

To solve this problem, I used geometric distance equation to find the loss value

def QQuat_mult(y_true, y_pred):
    """
    The function takes in two quaternions, normalizes the first one, and then multiplies the two
    quaternions together.

    The function returns the absolute value of the vector part of the resulting quaternion.

    The reason for this is that the vector part of the quaternion is the axis of rotation, and the
    absolute value of the vector part is the angle of rotation.

    The reason for normalizing the first quaternion is that the first quaternion is the predicted
    quaternion, and the predicted quaternion is not always normalized.

    The reason for returning the absolute value of the vector part of the resulting quaternion is that
    the angle of rotation is always positive.

    The reason for returning the vector part of the resulting quaternion is that the axis of rotation is
    always a vector.

    :param y_true: the ground truth quaternion
    :param y_pred: the predicted quaternion
    :return: The absolute value of the quaternion multiplication of the predicted and true quaternions.
    """

    y_pred = tf.linalg.normalize(y_pred, ord='euclidean', axis=1)[0]
    w0, x0, y0, z0 = tf.split(
        (tf.multiply(y_pred, [1., -1, -1, -1]),), num_or_size_splits=4, axis=-1)
    w1, x1, y1, z1 = tf.split(y_true, num_or_size_splits=4, axis=-1)
    w = w0*w1 - x0*x1 - y0*y1 - z0*z1
    w = tf.subtract(w, 1)
    x = w0*x1 + x0*w1 + y0*z1 - z0*y1
    y = w0*y1 - x0*z1 + y0*w1 + z0*x1
    z = w0*z1 + x0*y1 - y0*x1 + z0*w1

    loss = tf.abs(tf.concat(values=[w, x, y, z], axis=-1))
    return tf.reduce_mean(loss)