I'm using the SFTTrainer from the 'trl' package to fine-tune a language model. I would like to give it some negative examples, but I'm not seeing any built-in methods anywhere. Is there something I'm missing, or some way of implementing this in a custom way?
I tried looking in the documentation but haven't seen anything obvious.
SFTTrainer is designed for supervised fine-tuning (maximizing likelihood of in-distribution samples), so there is no straightforward way to utilize negative samples.
May be other alignment algorithms like KTO(also implemented in trl) would do the job in your case.
Another possible way is to modify prompt to include negative label in it. For example "{question} This is the wrong answer: {answer}".