I just finished reading the Transformer paper and BERT paper. But couldn't figure out why Transformer is uni-directional and BERT is bi-directional as mentioned in BERT paper. As they don't use recurrent networks, it's not so straightforward to interpret the directions. Can anyone give some clue? Thanks.
To clarify, the original Transformer model from Vaswani et al. is an encoder-decoder architecture. Therefore the statement "Transformer is uni-directional" is misleading.
In fact, the transformer encoder is bi-directional, which means that the self-attention can attend to tokens both on the left and right. In contrast, the decoder is uni-directional, since while generating text one token at a time, you cannot allow the decoder to attend to the right of the current token. The transformer decoder constrains the self-attention by masking the tokens to the right.
BERT uses the transformer encoder architecture and can therefore attend both to the left and right, resulting in "bi-directionality".
From the BERT paper itself:
We note that in the literature the bidirectional Transformer is often referred to as a “Transformer encoder” while the left-context-only version is referred to as a “Transformer decoder” since it can be used for text generation.
Recommended reading: this article.