num_labels = 3 if task.startswith("mnli") else 1 if task=="stsb" else 2
preconfig = DistilBertConfig(n_layers=6)
model1 = AutoModelForSequenceClassification.from_config(preconfig)
model2 = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=num_labels)
I am modifying this code (modified code is provided above) to test DistilBERT transformer layer depth size via from_config
since from my knowledge from_pretrained
uses 6 layers because in the paper section 3 they said:
we initialize the student from the teacher by taking one layer out of two
While what I want to test is various sizes of layers. To test whether both are the same, I tried running the from_config
with n_layers=6
because based on the documentation DistilBertConfig the n_layers
is used to determine the transformer block depth. However as I run model1
and model2
I found that with SST-2 dataset, in accuracy:
model1
achieved only 0.8073
model2
achieved 0.901
If they both behave the same I expect the result to be somewhat similar but 10% drop is a significant drop, therefore I believe there ha to be a difference between the functions. Is there a reason behind the difference of the approach (for example model1
has not yet applied hyperparameter search) and is there a way to make both functions behave the same? Thank you!
The two functions you described, from_config
and from_pretrained
, do not behave the same. For a model M, with a reference R:
from_config
allows you to instantiate a blank model, which has the same configuration (the same shape) as your model of choice: M is as R was before trainingfrom_pretrained
allows you to load a pretrained model, which has already been trained on a specific dataset for a given number of epochs: M is as R after training.To cite the doc, Note: Loading a model from its configuration file does not load the model weights. It only affects the model’s configuration. Use from_pretrained() to load the model weights.