In the process of doing RLHF I made a reward model using a dataset of chosen
and rejected
string pairs. It is very similar to the example that's there in the official TRL library - Reward Modeling
I used LLaMA 2 7b model (tried both the chat and non-chat versions - the behavior is the same). Now what I would like to do is to actually pass an input and see the output of the Reward model. However I can’t seem to make any sense of what the reward model outputs.
For example: I tried to make the input as follows -
chosen = "This is the chosen text."
rejected = "This is the rejected text."
test = {"chosen": chosen, "rejected": rejected}
Then I try -
import torch
import torch.nn as nn
from transformers import AutoModelForSequenceClassification, AutoTokenizer, AutoModelForCausalLM
base_model_id = "./llama2models/Llama-2-7b-chat-hf"
model_id = "./reward_models/Llama-2-7b-chat-hf_rm_inference/checkpoint-500"
model = AutoModelForSequenceClassification.from_pretrained(
model_id,
# num_labels=1, #gives an error since the model always outputs a tensor of [2, 4096]
)
tokenizer = AutoTokenizer.from_pretrained(base_model_id)
rewards_chosen = model(
**tokenizer(chosen, return_tensors='pt')
).logits
print('reward chosen is ', rewards_chosen)
rewards_rejected = model(
**tokenizer(rejected, return_tensors='pt')
).logits
print('reward rejected is ', rewards_rejected)
loss = -nn.functional.logsigmoid(rewards_chosen - rewards_rejected).mean()
print(loss)
And the output looks something like this -
reward chosen is tensor([[ 2.1758, -8.8359]], dtype=torch.float16)
reward rejected is tensor([[ 1.0908, -2.2168]], dtype=torch.float16)
tensor(0.0044)
Printing loss wasn’t helpful. I mean I do not see any trend (for example positive loss turning negative) even if I switch rewards_chosen
and rewards_rejected
in the formula.
Also the outputs did not yield any insights. I do not understand how to make sense of rewards_chosen
and rewards_rejected
. Why are they a tensor with two elements instead of one?
I tried rewards_chosen > rewards_rejected
but that is also not helpful since it outputs tensor([[ True, False]])
When I try some public reward model (its just a few megabytes since its just the adapter - https://huggingface.co/vincentmin/llama-2-13b-reward-oasst1) then I get outputs that make more sense since its outputs a single element tensor -
Code -
import torch
import torch.nn as nn
from peft import PeftModel, PeftConfig
from transformers import AutoModelForSequenceClassification, AutoTokenizer, AutoModelForCausalLM
peft_model_id = "./llama-2-13b-reward-oasst1"
base_model_id = "/cluster/work/lawecon/Work/raj/llama2models/13b-chat-hf"
config = PeftConfig.from_pretrained(peft_model_id)
model = AutoModelForSequenceClassification.from_pretrained(
base_model_id,
num_labels=1,
# torch_dtype=torch.float16,
)
model = PeftModel.from_pretrained(model, peft_model_id)
tokenizer = AutoTokenizer.from_pretrained(base_model_id)
chosen = "prompter: What is your purpose? assistant: My purpose is to assist you."
rejected = "prompter: What is your purpose? assistant: I do not understand you."
test = {"chosen": chosen, "rejected": rejected}
model.eval()
with torch.no_grad():
rewards_chosen = model(
**tokenizer(chosen, return_tensors='pt')
).logits
print('reward chosen is ', rewards_chosen)
rewards_rejected = model(
**tokenizer(rejected, return_tensors='pt')
).logits
print('reward rejected is ', rewards_rejected)
loss = -nn.functional.logsigmoid(rewards_chosen - rewards_rejected).mean()
print(loss)
Output -
reward chosen is tensor([[0.6876]])
reward rejected is tensor([[-0.9243]])
tensor(0.1819)
This output makes more sense to me. But why do I get the outputs that have two values with my reward model?
I've been facing the exact same issue myself, after following the same example on the TRL library! I think there's a mistake in that example; reward models should output single-element tensors as you suggest, rather than two-element tensors.
I believe that setting num_labels=1
when calling AutoModelForSequenceClassification.from_pretrained
is the solution here. This instantiates a model with a single-element output.
I can see that you've commented this out in your example, saying that it "gives an error since the model always outputs a tensor of [2, 4096]". I get no such error, so I'm not sure what's going on for you there.