python tensorflow keras tensor activation

Understanding the 'axis' parameter in the softmax activation function

Suppose i have an input tensor carrying one embedded word per timestep, eg for a time window of 5 and word embedding vector width of 64 i get the shape:

(None, 5, 64, 1)

I apply 4 filters with a kernel shape of (1, 64) to look for specific words at each time step, each filter produces 1 value per timestep denoting "word/meaning exists" or "word/meaning does not exist". It produces an output tensor of shape:

(None, 5, 1, 4)

How do i define the 'axis' parameter of the softmax layer such that the outputs of all convolutions per timestep are normalized, like in a classification task?

More specifically, i want the output to look like the following (height is time, width is channels):

[[[.1, .4, .4, .1]]
 [[.9,  0,  0, .1]]
 [[.8,  0, .1, .1]]
 [[ 0,  1,  0,  0]]
 [[.6,. 1, .1, .2]]]

Ie the components per row/timestep add up to one, the softmax should only normalize rows.

Code snippet:

model.add(layers.Conv2D(
    filters=words_of_interest,
    kernel_size=(1, embedding_length),
    strides=(1, embedding_length),
    padding="same")
)
model.add(layers.Softmax(axis=3)) # <- is this correct for what i described above?

Solution

For all values of axis its okay. Tensorflow normalized the values of the array in same way. Here you can check the difference between tensor and normalized using below code.

import numpy as np
import tensorflow as tf

array = np.random.random((5, 4))
tensor = tf.convert_to_tensor(array)

norm0 = tf.keras.activations.softmax(tensor, axis=0)
norm1 = tf.keras.activations.softmax(tensor, axis=1)
norm2 = tf.keras.activations.softmax(tensor, axis=2)
norm3 = tf.keras.activations.softmax(tensor, axis=3)

print(sum(tensor[0]))
print(sum[norm0[0]])
print(sum[norm1[0]])
print(sum[norm2[0]])
print(sum[norm3[0]])