Search code examples
pythonlarge-language-modelllama

TRL SFTTrainer - llama2 finetuning on Alpaca - datasettext field


I am trying to finetune the Llama2 model using the alpaca dataset. I have loaded the model in 4-bit and apply the peft config to the model for Lora training. Then I am trying to do TRL’s SFTTrainer to fine-tune the model.

The train_dataset is

Dataset({
    features: ['instruction', 'input', 'output', 'input_ids', 'attention_mask'],
    num_rows: 50002
})

This is the error that I get:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[28], line 3
      1 # Step 8 :Set supervised fine-tuning parameters
      2 from transformers import DataCollatorForLanguageModeling
----> 3 trainer = SFTTrainer(
      4     model=model,
      5     train_dataset=train_data,
      6     #eval_dataset=val_data,
      7     #peft_config=peft_config,
      8     #dataset_text_field="train",
      9     max_seq_length=max_seq_length,
     10     tokenizer=tokenizer,
     11     args=training_arguments,
     12     #packing=True,
     13     #data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False),
     14 )


ValueError: You passed `packing=False` to the SFTTrainer, but you didn't pass a `dataset_text_field` or `formatting_func` argument.

If I try to pass the packing = True then I get this error:

ValueError: You need to pass a `dataset_text_field` or `formatting_func` argument to the SFTTrainer if you want to use the `ConstantLengthDataset`.

If I provide the dataset_text_field, which I do not know what it is. I tried with "train" or "text" keywords and I am getting this error:

ValueError: the `--group_by_length` option is only available for `Dataset`, not `IterableDataset

I appreciate if someone can help me to understand the "dataset_text_filed", where do I set ConstantLengthDataset (does it come from packing?). I also tried with packing = False and provide the dataset_text_field with 'train' and 'text' and they are incorrect.

based on the documentation:

dataset_text_field (Optional[str]): The name of the text field of the dataset, in case this is passed by a user, the trainer will automatically create a ConstantLengthDataset based on the dataset_text_field argument.

Solution

  • dataset_text_field (Optional[str]) is the name of the field in the training dataset that contains the text that will be used for training only if formatting_func is None.

    You should be careful because if you do this:

    `dataset_text_field='instruction'
    

    SFTTrainer will only read the text saved in train_dataset['instruction']. So that Llama2 will only learn to predict the instructions without the answers.

    Each line of your dataset(train_data) should have a string saved in 'text' field like this:

    train_data[0]['text']="<s>[INST]<<SYS>>You are an expert in math.<</SYS>>Compute 2+2 [/INST]4 </s>"
    

    And you should do this:

    `dataset_text_field='text'