bert-language-model transformer-model large-language-model

Doubts regarding ELECTRA Paper Implementation

I am a master's student currently studying NLP. I was reading the ELECTRA paper by Clark et al. I had a few doubts regarding the implementation and training.

I was wondering if you could help me with those.

What exactly does the "Step" mean in step count? Does it mean 1 epoch or 1 minibatch?
Also, in the paper I saw (specifically in Table 1), ELECTRA-SMALL and BERT-SMALL both have 14M parameters, how is that possible as ELECTRA should have more parameters because its generator and discriminator module are both BERT-based?
Also, what is the architecture of both the generator and discriminator? Are they both BERT to something else?
Also, we have a sampling step between the generator and the discriminator. How are you back-propagating the gradients through this?

Thanks in advance

Well, I tried looking online for answers, but they were not cconclusive. Regarding backpropagating the gradients, i think the gradients in discriminator are not backpropagated to the generator , both are trained separately, although the generated input of current step is put as input to the discriminator.

Solution

Okay, from the paper itself the answers can be given.

Yes step means 1 minibatch. Basically everytime optimizer.step() is called, it counts as 1 step.
It is considered for fine-tuning, so only discriminator is used.
Any transformer Encoder can be used, but for their implementation they have used BERT.
They(discriminator and generator) are both trained jointly, i.e. one sample is passed through generator and then through discriminator , the corresponding losses are calculated and backpropogated.