Search code examples
tensorflowtransformer-modelattention-modelautoregressive-models

Is tensorflow multi-head attention layer autoregressive? e.g. "tfa.layers.MultiHeadAttention"


I looked at the difference between an autoregressive vs non-autoregressive in transformer architecture. but I am wondering whether the attention layer in TensorFlow is actually autoregressive? or do I need to implement the autoregressive mechanism?

I don't see any option for causal (e.g. causal=true/false)

I do not see documentation that states if "tfa.layers.MultiHeadAttention" is autoregressive or not

Any thoughts on that would be appreciated.


Solution

  • I found the solution:

    I found that TensorFlow has a single head attention layer with a causal option (it has a boolean option to be either True or False) which was the best option for my case. The link for the layer code is below:

    https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/keras/layers/dense_attention.py

    This layer adds a mask such that position i cannot attend to positions j > i. This prevents the flow of information from the future towards the past.

    Can be written as shown below:

    tf.keras.layers.Attention(causal=True,dropout = 0.5)