I am trying to finetune the Llama2 model using the alpaca dataset. I have loaded the model in 4-bit and apply the peft config to the model for Lora training. Then I am trying to do TRL’s SFTTrainer to fine-tune the model.
The train_dataset is
Dataset({
features: ['instruction', 'input', 'output', 'input_ids', 'attention_mask'],
num_rows: 50002
})
This is the error that I get:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[28], line 3
1 # Step 8 :Set supervised fine-tuning parameters
2 from transformers import DataCollatorForLanguageModeling
----> 3 trainer = SFTTrainer(
4 model=model,
5 train_dataset=train_data,
6 #eval_dataset=val_data,
7 #peft_config=peft_config,
8 #dataset_text_field="train",
9 max_seq_length=max_seq_length,
10 tokenizer=tokenizer,
11 args=training_arguments,
12 #packing=True,
13 #data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False),
14 )
ValueError: You passed `packing=False` to the SFTTrainer, but you didn't pass a `dataset_text_field` or `formatting_func` argument.
If I try to pass the packing = True then I get this error:
ValueError: You need to pass a `dataset_text_field` or `formatting_func` argument to the SFTTrainer if you want to use the `ConstantLengthDataset`.
If I provide the dataset_text_field, which I do not know what it is. I tried with "train" or "text" keywords and I am getting this error:
ValueError: the `--group_by_length` option is only available for `Dataset`, not `IterableDataset
I appreciate if someone can help me to understand the "dataset_text_filed", where do I set ConstantLengthDataset (does it come from packing?). I also tried with packing = False and provide the dataset_text_field with 'train' and 'text' and they are incorrect.
based on the documentation:
dataset_text_field (Optional[str]): The name of the text field of the dataset, in case this is passed by a user, the trainer will automatically create a ConstantLengthDataset based on the dataset_text_field argument.
dataset_text_field
(Optional[str]
) is the name of the field in the training dataset that contains the text that will be used for training only if formatting_func
is None
.
You should be careful because if you do this:
`dataset_text_field='instruction'
SFTTrainer will only read the text saved in train_dataset['instruction']
.
So that Llama2 will only learn to predict the instructions without the answers.
Each line of your dataset(train_data
) should have a string saved in 'text' field like this:
train_data[0]['text']="<s>[INST]<<SYS>>You are an expert in math.<</SYS>>Compute 2+2 [/INST]4 </s>"
And you should do this:
`dataset_text_field='text'