tensorflow keras padding masking attention-model

Effect of padding sequences in MultiHeadAttention (TensorFlow/Keras)

I am trying to use the MultiHeadAttention layer to process variable-length sets of elements, that is, sequences where the order is not important (otherwise I would try RNNs). The problem is that I'm not sure I'm understanding the effect of padding in the input sequence. My point is that the output of a sequence including elements 1 and 2 should be equal to the output of the same sequence with 0's padding to a given length. In other words, the input [1, 2] and [1, 2, 0] (or even [1, 2, 0, 0, 0 ... ] should yield the same output regarding the true inputs (1, and 2, I don't mind the output for the 0s because I know it is a "fake" input to pad). The following is a piece of code to show the different outputs depending on the padding.

import tensorflow as tf
import numpy as np

max_tokens = 10  # maximum length of any sequence
dimension = 5  # dimension of the vectors in the embedding

# Variable-length int sequences.
query_input = tf.keras.layers.Input(shape=(None,), dtype='int32')
value_input = tf.keras.layers.Input(shape=(None,), dtype='int32')

handmade_embedding = np.arange(max_tokens).reshape(max_tokens, 1) * np.ones(dimension)

# Embedding lookup.
token_embedding = tf.keras.layers.Embedding(input_dim=max_tokens, output_dim=dimension, mask_zero=True,
                                            embeddings_initializer=tf.constant_initializer(handmade_embedding),
                                            trainable=False)

# Query embeddings of shape [batch_size, Tq, dimension].
query_embeddings = token_embedding(query_input)
# Value embeddings of shape [batch_size, Tv, dimension].
value_embeddings = token_embedding(value_input)

attention_output, weights = \
    tf.keras.layers.MultiHeadAttention(num_heads=10, key_dim=10)(query=query_embeddings,
                                                                 value=value_embeddings,
                                                                 return_attention_scores=True)

model = tf.keras.Model(inputs=[query_input, value_input],
                       outputs=[query_embeddings, attention_output])
names = ('query_embeddings', 'attention_output')

model.summary()

q = np.array([[1, 2, 0]])
prediction = model.predict([q, q])  # self-attention

print('\nWITH PADDING')
for n, v in zip(names, prediction):
    print(f'\n{n}:\n{v}')

q = q[:, :-1]  # remove the padding column in this example
prediction = model.predict([q, q])  # self-attention
print('\nWITHOUT PADDING')
for n, v in zip(names, prediction):
    print(f'\n{n}:\n{v}')

The output of the MultiHeadAttention layer with padding is the following:

attention_output:
[[[-0.0374077  -0.03303239 -0.02354158 -0.04111823  0.08189851]
  [-0.04877335 -0.04348412 -0.012391   -0.04778382  0.09745573]
  [-0.02586985 -0.02244503 -0.03482261 -0.03429744  0.06620502]]]

and without padding:

attention_output:
[[[-0.04313684 -0.03764199 -0.04799934 -0.05400878  0.10519686]
  [-0.04743624 -0.041591   -0.04378954 -0.05654225  0.11106053]]]

I expected the first and second output vectors to be the same but it is not the case. I plan to process later those vectors and to summarize them into a single vector (average, or whatever), but I would like deterministic outputs regarding the length of the padding. What am I misunderstanding?

Solution

Well, after some months of letting the code rest in my computer, now it seems it is not even needed the attention_mask. Now the output is why I expected, i.e., the same for the true entries. Maybe there were some internal changes in TensorFlow affecting this. I'm going a bit crazy...