Custom loss in Keras with softmax to one-hot

I have a model that outputs a Softmax, and I would like to develop a custom loss function. The desired behaviour would be:

1) Softmax to one-hot (normally I do numpy.argmax(softmax_vector) and set that index to 1 in a null vector, but this is not allowed in a loss function).

2) Multiply the resulting one-hot vector by my embedding matrix to get an embedding vector (in my context: the word-vector that is associated to a given word, where words have been tokenized and assigned to indices, or classes for the Softmax output).

3) Compare this vector with the target (this could be a normal Keras loss function).

I know how to write a custom loss function in general, but not to do this. I found this closely related question (unanswered), but my case is a bit different, since I would like to preserve my softmax output.

Solution

Fan Luo's answer points in the right direction, but ultimately will not work because it involves non-derivable operations. Note such operations are acceptable for the real value (a loss function takes a real value and a predicted value, non-derivable operations are only fine for the real value).

To be fair, that was what I was asking in the first place. It is not possible to do what I wanted, but we can get a similar and derivable behaviour:

1) Element-wise power of the softmax values. This makes smaller values much smaller. For example, with a power of 4 [0.5, 0.2, 0.7] becomes [0.0625, 0.0016, 0.2400]. Note that 0.2 is comparable to 0.7, but 0.0016 is negligible with respect to 0.24. The higher my_power is, the more similar to a one-hot the final result will be.

soft_extreme = Lambda(lambda x: x ** my_power)(softmax)

2) Importantly, both softmax and one-hot vectors are normalized, but not our "soft_extreme". First, find the sum of the array:

norm = tf.reduce_sum(soft_extreme, 1)

3) Normalize soft_extreme:

almost_one_hot = Lambda(lambda x: x / norm)(soft_extreme)

Note: Setting my_power too high in 1) will result in NaNs. If you need a better softmax to one-hot conversion, then you may do steps 1 to 3 two or more times in a row.

4) Finally we want the vector from the dictionary. Lookup is forbidden, but we can take the average vector using matrix multiplication. Because our soft_normalized is similar to one-hot encoding this average will be similar to the vector associated to the highest argument (original intended behaviour). The higher my_power is in (1), the truer this will be:

target_vectors = tf.tensordot(almost_one_hot, embedding_matrix, axes=[[1], [0]])

Note: This will not work directly using batches! In my case, I reshaped my "one hot" (from [batch, dictionary_length] to [batch, 1, dictionary_length] using tf.reshape. Then tiled my embedding_matrix batch times and finally used:

predicted_vectors = tf.matmul(reshaped_one_hot, tiled_embedding)

There may be more elegant solutions (or less memory-hungry, if tiling the embedding matrix is not an option), so feel free to explore more.