I'm new to NLP (pardon the very noob question!), and am looking for a way to perform vector operations on sentence embeddings (e.g., randomization in embedding-space in a uniform ball around a given sentence) and then decode them. I'm currently attempting to use the following strategy with T5 and Huggingface Transformers:
T5Tokenizer
.model.encoder
. Use the last hidden state as the embedding. (I've tried .generate
as well, but it doesn't allow me to use the decoder separately from the encoder.)model.decoder
and decode with the tokenizer.I'm having trouble with (4). My sanity check: I set (3) to do nothing (no change to the embedding), and I check whether the resulting text is the same as the input. So far, that check always fails.
I get the sense that I'm missing something rather important (something to do with the lack of beam search or some other similar generation method?). I'm unsure of whether what I think is an embedding (as in (2)) is even correct.
How would I go about encoding a sentence embedding with T5, modifying it in that vector space, and then decoding it into generated text? Also, might another model be a better fit?
As a sample, below is my incredibly broken code, based on this:
t5_model = transformers.T5ForConditionalGeneration.from_pretrained("t5-large")
t5_tok = transformers.T5Tokenizer.from_pretrained("t5-large")
text = "Foo bar is typing some words."
input_ids = t5_tok(text, return_tensors="pt").input_ids
encoder_output_vectors = t5_model.encoder(input_ids, return_dict=True).last_hidden_state
# The rest is what I think is problematic:
decoder_input_ids = t5_tok("<pad>", return_tensors="pt", add_special_tokens=False).input_ids
decoder_output = t5_model.decoder(decoder_input_ids, encoder_hidden_states=encoder_output_vectors)
t5_tok.decode(decoder_output.last_hidden_state[0].softmax(0).argmax(1))
Much easier than anticipated! For anyone else looking for an answer, this page in HuggingFace's docs wound up helping me the most. Below is an example with code based heavily on that page.
First, to get the hidden layer embeddings:
encoder_input_ids = self.tokenizer(encoder_input_str, return_tensors="pt").input_ids
embeds = self.model.get_encoder()(
encoder_input_ids.repeat_interleave(num_beams, dim=0),
return_dict=True
)
Note that using repeat_interleave
above is only necessary for decoding methods such as beam search. Otherwise, no repetition in the hidden layer embedding is necessary.
HuggingFace provides many methods for decoding, just as would be accessible via generate()
's options. These are documented in the article linked above. To provide an example of decoding using beam search with num_beams
beams:
model_kwargs = {
"encoder_outputs": encoder_outputs
}
# Define decoder start token ids
input_ids = torch.ones((num_beams, 1), device=model.device, dtype=torch.long)
input_ids = input_ids * model.config.decoder_start_token_id
# Instantiate three configuration objects for scoring
beam_scorer = BeamSearchScorer(
batch_size=1,
num_beams=num_beams,
device=model.device,
)
logits_processor = LogitsProcessorList(
[
MinLengthLogitsProcessor(5, eos_token_id=model.config.eos_token_id),
]
)
stopping_criteria = StoppingCriteriaList([
MaxLengthCriteria(max_length=max_length),
])
outputs = model.beam_search(input_ids, beam_scorer, logits_processor=logits_processor, stopping_criteria=stopping_criteria, **model_kwargs)
results = tokenizer.batch_decode(outputs, skip_special_tokens=True)
Similar approaches can be taken for greedy and contrastive search, with different parameters. Similarly, different stopping criteria can be used.