Search code examples
pythonkerasdeep-learningtf.kerasattention-model

How is attention layer implemented in keras?


I am learning about attention models and its implementations in keras. While searching I came across these two methods first and second using which we can create an attention layer in keras

# First method

class Attention(tf.keras.Model):
    def __init__(self, units):
        super(Attention, self).__init__()
        self.W1 = tf.keras.layers.Dense(units)
        self.W2 = tf.keras.layers.Dense(units)
        self.V = tf.keras.layers.Dense(1)

    def call(self, features, hidden):
        hidden_with_time_axis = tf.expand_dims(hidden, 1)
        score = tf.nn.tanh(self.W1(features) + self.W2(hidden_with_time_axis))
        attention_weights = tf.nn.softmax(self.V(score), axis=1)
        context_vector = attention_weights * features
        context_vector = tf.reduce_sum(context_vector, axis=1)

        return context_vector, attention_weights

# Second method

activations = LSTM(units, return_sequences=True)(embedded)

# compute importance for each step
attention = Dense(1, activation='tanh')(activations)
attention = Flatten()(attention)
attention = Activation('softmax')(attention)
attention = RepeatVector(units)(attention)
attention = Permute([2, 1])(attention)

sent_representation = merge([activations, attention], mode='mul')

The math behind attention model is

enter image description here

If we look at the first metod it was somewhat a direct implementation of the attention math whereas the second method which has more number of hits in internet is not.

My real doubt is in these lines in the second method

attention = RepeatVector(units)(attention)
attention = Permute([2, 1])(attention)
sent_representation = merge([activations, attention], mode='mul')
  • Which is the right implementation for attention?
  • What is the intution behind RepeatVector and Permute layer in second method?
  • In the first method W1,W2 are weights; why is a dense layer is consider as weights here?
  • Why is the V value is considered as a single unit dense layer?
  • What is V(score) do?

Solution

  • Which is the right implementation for attention?

    I'd recommend the following:

    https://github.com/tensorflow/models/blob/master/official/transformer/model/attention_layer.py#L24

    The multi-header Attention layer above implements an nifty trick: it reshapes the matrix so that instead of it being shaped as (batch_size, time_steps, features) it is shaped as (batch_size, heads, time_steps, features / heads) and then it performs a computation on the "features / heads" block.

    What is the intution behind RepeatVector and Permute layer in second method?

    Your code is incomplete... there is a matrix multiplication missing in your code (you don't show the Attention layer being used). That probably modified the shape of the result and this code is trying to somehow recover the right shape. It is probably not the best approach.

    In the first method W1,W2 are weights; why is a dense layer is consider as weights here?

    A Dense layer is a set of weights... Your question is a bit vague.

    Why is the V value is considered as a single unit dense layer?

    That is a very odd choice that doesn't match my reading of the paper nor the implementations that I've seen.