nlp word-embedding machine-translation attention-model

Why no word embeddings (Glove, word2vecetc) used in first attention paper?

In the paper Neural Machine Translation by Jointly Learning to Align and Translate Bahdanau et. al. why are there no word embeddings such as Glove or word2vec used?

I understand that this was a 2014 paper, but the current implementations of the paper on github don't use any word embeddings as well?

For trying to code the paper is using word embeddings reasonable?

Solution

In short - the model certainly does use word embeddings, they are just not pre-trained embeddings like Glove or word2vec; instead, the embeddings are randomly initialised and jointly trained along with the rest of the network.

In the full description of the network in section A.2 of the original Bahdanau et al. paper, you'll see the word embedding matrices E described for both the encoder and decoder. How they were initialised is also described in section B.1.

This usually works as well as or better than pre-trained embeddings in situations where you have enough data. That said, in a low-resource setting, it can help to initialise the embedding matrix with pre-trained embeddings. This paper might help you explore that idea in further detail.

In addition, your statement that current implementations don't do this is not entirely accurate - while it's true that the embeddings are usually jointly trained by default, many existing neural MT toolkits have the option to initialise the embeddings with pre-trained vectors. For example, OpenNMT-py, Marian.