I know I can load the smallest GPT2 variant using
from transformers import AutoTokenizer, GPT2LMHeadModel, AutoConfig
config = AutoConfig.from_pretrained(
"gpt2",
vocab_size=len(tokenizer),
n_ctx=context_length,
bos_token_id=tokenizer.bos_token_id,
eos_token_id=tokenizer.eos_token_id,
)
model = GPT2LMHeadModel(config)
model_size = sum(t.numel() for t in model.parameters())
print(f"GPT-2 size: {model_size/1000**2:.1f}M parameters")
>>> GPT-2 size: 124.2M parameters
But how can I load a GPT2 architecture with a smaller number of decoder layers? Say, 3 or 5 instead of the original (I think it's 12)? Note that I'm training this from scratch so I'm not looking for an already pretrained model.
In order to stack 3 or 5 decoder layers rather than the default number of layers gpt2
has (12) it is sufficient to pass either n_layer=3
or n_layer=5
as an additional parameter to .from_pretrained()
method of the AutoConfig
class (GPT2Config
under the hood).
config = AutoConfig.from_pretrained(
"gpt2",
vocab_size=len(tokenizer),
n_ctx=context_length,
bos_token_id=tokenizer.bos_token_id,
eos_token_id=tokenizer.eos_token_id,
n_layer=3
)
Alternatively, you can also pass num_hidden_layers=3
or num_hidden_layers=5
. Indeed, due to https://github.com/huggingface/transformers/pull/13026, the two are interchangeable.
config = AutoConfig.from_pretrained(
"gpt2",
vocab_size=len(tokenizer),
n_ctx=context_length,
bos_token_id=tokenizer.bos_token_id,
eos_token_id=tokenizer.eos_token_id,
num_hidden_layers=3
)