What is the training data input to the transformers (attention is all you need)?

Sorry I'm new to NLP. Please bear with me. Say I have two sentences:

French: Le chat mange.

English: The cat eats.

In the following text, I will denote a training data as a tuple (x, y), where x is the input data and y is the annotation.

When I train a transformer network, do I A. input these two sentences synchronously as training data, i.e. (Le chat mange, The cat eats)? Or do I B. use ((Le chat mange, ), The), ((Le chat mange, The), cat), ((Le chat mange, The cat), eats) as training data?

If it's A, sounds like I have to wait for the network to produce the words one by one during training, which would not be parallelizable. So I guess it should be B?

Solution

I figured it out. This "shifting" of the source sentence is done by applying the "mask" mentioned in the paper.

The mask looks like this

M=[0, 0, ..., 0
   1, 0, ..., 0
   1, 1, ..., 0]

In self attention, since the matrix QK^T (scaling factor ignored) represents the cross-correlation between the "queries" and the "keys", when the mask is applied: M o (QK^T) (o denotes elementwise multiplication), the correlations between the "current query" Q[i,:] and "future" keys K[i+k,:], for k=1,...,N-i are ignored.