Search code examples
pythonhuggingface-transformersllamareward

How to make sense of the output of the reward model, how do we know what string it is preferring?


In the process of doing RLHF I made a reward model using a dataset of chosen and rejected string pairs. It is very similar to the example that's there in the official TRL library - Reward Modeling

I used LLaMA 2 7b model (tried both the chat and non-chat versions - the behavior is the same). Now what I would like to do is to actually pass an input and see the output of the Reward model. However I can’t seem to make any sense of what the reward model outputs.

For example: I tried to make the input as follows -

chosen = "This is the chosen text."
rejected = "This is the rejected text."
test = {"chosen": chosen, "rejected": rejected}

Then I try -

import torch
import torch.nn as nn

from transformers import AutoModelForSequenceClassification, AutoTokenizer, AutoModelForCausalLM
base_model_id = "./llama2models/Llama-2-7b-chat-hf"
model_id = "./reward_models/Llama-2-7b-chat-hf_rm_inference/checkpoint-500"

model = AutoModelForSequenceClassification.from_pretrained(
    model_id,
    # num_labels=1, #gives an error since the model always outputs a tensor of [2, 4096]
)

tokenizer = AutoTokenizer.from_pretrained(base_model_id)
rewards_chosen = model(
            **tokenizer(chosen, return_tensors='pt')
        ).logits
print('reward chosen is ', rewards_chosen)

rewards_rejected = model(
           **tokenizer(rejected, return_tensors='pt')
        ).logits

print('reward rejected is ', rewards_rejected)
loss = -nn.functional.logsigmoid(rewards_chosen - rewards_rejected).mean()
print(loss)

And the output looks something like this -

reward chosen is  tensor([[ 2.1758, -8.8359]], dtype=torch.float16)
reward rejected is  tensor([[ 1.0908, -2.2168]], dtype=torch.float16)
tensor(0.0044)

Printing loss wasn’t helpful. I mean I do not see any trend (for example positive loss turning negative) even if I switch rewards_chosen and rewards_rejected in the formula.

Also the outputs did not yield any insights. I do not understand how to make sense of rewards_chosen and rewards_rejected. Why are they a tensor with two elements instead of one?

I tried rewards_chosen > rewards_rejected but that is also not helpful since it outputs tensor([[ True, False]])

When I try some public reward model (its just a few megabytes since its just the adapter - https://huggingface.co/vincentmin/llama-2-13b-reward-oasst1) then I get outputs that make more sense since its outputs a single element tensor -

Code -

import torch
import torch.nn as nn

from peft import PeftModel, PeftConfig
from transformers import AutoModelForSequenceClassification, AutoTokenizer, AutoModelForCausalLM

peft_model_id = "./llama-2-13b-reward-oasst1"
base_model_id = "/cluster/work/lawecon/Work/raj/llama2models/13b-chat-hf"
config = PeftConfig.from_pretrained(peft_model_id)
model = AutoModelForSequenceClassification.from_pretrained(
    base_model_id,
    num_labels=1,
    # torch_dtype=torch.float16,
)
model = PeftModel.from_pretrained(model, peft_model_id)
tokenizer = AutoTokenizer.from_pretrained(base_model_id)
chosen = "prompter: What is your purpose? assistant: My purpose is to assist you."
rejected = "prompter: What is your purpose? assistant: I do not understand you."
test = {"chosen": chosen, "rejected": rejected}

model.eval()
with torch.no_grad():
    rewards_chosen = model(
                **tokenizer(chosen, return_tensors='pt')
            ).logits
    print('reward chosen is ', rewards_chosen)

    rewards_rejected = model(
               **tokenizer(rejected, return_tensors='pt')
            ).logits

    print('reward rejected is ', rewards_rejected)
    loss = -nn.functional.logsigmoid(rewards_chosen - rewards_rejected).mean()
    print(loss)

Output -

reward chosen is  tensor([[0.6876]])
reward rejected is  tensor([[-0.9243]])
tensor(0.1819)

This output makes more sense to me. But why do I get the outputs that have two values with my reward model?


Solution

  • I've been facing the exact same issue myself, after following the same example on the TRL library! I think there's a mistake in that example; reward models should output single-element tensors as you suggest, rather than two-element tensors.

    I believe that setting num_labels=1 when calling AutoModelForSequenceClassification.from_pretrained is the solution here. This instantiates a model with a single-element output.

    I can see that you've commented this out in your example, saying that it "gives an error since the model always outputs a tensor of [2, 4096]". I get no such error, so I'm not sure what's going on for you there.