Sorry I'm new to NLP. Please bear with me. Say I have two sentences:
French: Le chat mange.
English: The cat eats.
In the following text, I will denote a training data as a tuple (x, y)
, where x
is the input data and y
is the annotation.
When I train a transformer network, do I A. input these two sentences synchronously as training data, i.e. (Le chat mange, The cat eats)
? Or do I B. use
((Le chat mange, ), The), ((Le chat mange, The), cat), ((Le chat mange, The cat), eats)
as training data?
If it's A, sounds like I have to wait for the network to produce the words one by one during training, which would not be parallelizable. So I guess it should be B?
I figured it out. This "shifting" of the source sentence is done by applying the "mask" mentioned in the paper.
The mask looks like this
M=[0, 0, ..., 0
1, 0, ..., 0
1, 1, ..., 0]
In self attention, since the matrix QK^T
(scaling factor ignored) represents the cross-correlation between the "queries" and the "keys", when the mask is applied: M o (QK^T)
(o
denotes elementwise multiplication), the correlations between the "current query" Q[i,:]
and "future" keys K[i+k,:]
, for k=1,...,N-i
are ignored.