python huggingface-transformers large-language-model inference pre-trained-model

Saving Fine-tune Falcon HuggingFace LLM Model

I'm trying to save my model so it won't need to re-download the base model every time I want to use it but nothing seems to work for me, I would love your help with it.

The following parameters are used for the training:

hf_model_name = "tiiuae/falcon-7b-instruct"
dir_path = 'Tiiuae-falcon-7b-instruct'
model_name_is = f"peft-training"
output_dir = f'{dir_path}/{model_name_is}'
logs_dir = f'{dir_path}/logs'
model_final_path = f"{output_dir}/final_model/"
EPOCHS = 3500
LOGS = 1
SAVES = 700
EVALS = EPOCHS / 100
compute_dtype = getattr(torch, "float16")
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=False,
)
model = AutoModelForCausalLM.from_pretrained(
        "tiiuae/falcon-7b-instruct",
        quantization_config=bnb_config,
        device_map={"": 0},
        trust_remote_code=False
)
peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.05, # 0.1
    r=64,
    bias="lora_only", # none
    task_type="CAUSAL_LM",
    target_modules=[
        "query_key_value"
    ],
)
model.config.use_cache = False
model = get_peft_model(model, peft_config)
model.print_trainable_parameters()
tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-7b-instruct", trust_remote_code=False)
tokenizer.pad_token = tokenizer.eos_token
training_arguments = TrainingArguments(
    output_dir=output_dir,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,
    optim='paged_adamw_32bit',
    max_steps=EPOCHS,
    save_steps=SAVES,
    logging_steps=LOGS,
    logging_dir=logs_dir,
    eval_steps=EVALS,
    evaluation_strategy="steps",
    fp16=True,
    learning_rate=0.001,
    max_grad_norm=0.3,
    warmup_ratio=0.15, # 0.03
    lr_scheduler_type="constant",
    disable_tqdm=True,
)
model.config.use_cache = False
trainer = SFTTrainer(
    model=model,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=448,
    tokenizer=tokenizer,
    args=training_arguments,
    packing=True,
)
for name, module in trainer.model.named_modules():
    if "norm" in name:
        module = module.to(torch.float32)
train_result = trainer.train()

And the saving of it I did like so:

metrics = train_result.metrics
max_train_samples = len(train_dataset)
metrics["train_samples"] = min(max_train_samples, len(train_dataset))
# save train results
trainer.log_metrics("train", metrics)
trainer.save_metrics("train", metrics)
# compute evaluation results
metrics = trainer.evaluate()
max_val_samples = len(eval_dataset)
metrics["eval_samples"] = min(max_val_samples, len(eval_dataset))
# save evaluation results
trainer.log_metrics("eval", metrics)
trainer.save_metrics("eval", metrics)

model.save_pretrained(model_final_path)

Now I've tried so many different ways to load it or load and save it in various ways again and again (for example adding lora_model.merge_and_unload(), plain using local_model = AutoModelForCausalLM.from_pretrained(after_merge_model_path) and more), but nothing seems to work for me everything resulted in errors (sometimes the same errors, sometimes different ones), I need your help here.

If you think its better suited, I opened a question here too HuggingFace Forum

Solution

The fine-tuning is done by training adapters on top of the base model. And after the training you only save the adapter, not the base model. So the workflow is the following:

During training:

you download the base model from HF and save it in cache directory
you train PEFT adapter and save it

During inferencing

Load cached HF base model
Load saved peft adapter and apply it to the base model

Step 1. Download HF model in predefined cached directory:

import os
from pathlib import Path

# set cache for pretrained model
os.environ['HF_HOME'] = '/content/assets/hf_cache/'
os.environ['HF_DATASETS_CACHE'] = '/content/assets/hf_datasets/'

dir_path = Path('/content')
adapter_final_path = dir_path / f"output" / "final_adapter"
base_quantized_path = dir_path / f"output" / "base_model_q"

hf_model_name = "tiiuae/falcon-7b-instruct"

# load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(hf_model_name, 
                              trust_remote_code=False)
        
tokenizer.pad_token = tokenizer.eos_token


# load the model
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=False,
)


model = AutoModelForCausalLM.from_pretrained(
        hf_model_name,
        quantization_config=bnb_config,
        device_map={"": 0},
        trust_remote_code=False
)

model.save_pretrained(base_quantized_path)
tokenizer.save_pretrained(base_quantized_path)
...

After the training save peft adapters:

... train the model...

train_result = trainer.train()


model.save_pretrained(adapter_final_path)

During inference reload base model and peft adapter:


# load base model
model = AutoModelForCausalLM.from_pretrained(base_quantized_path)
tokenizer = AutoTokenizer.from_pretrained(base_quantized_path)

# apply saved adapter to the model
model.load_adapter(adapter_final_path)