python machine-learning nlp huggingface-transformers transformer-model

Decoding hidden layer embeddings in T5

I'm new to NLP (pardon the very noob question!), and am looking for a way to perform vector operations on sentence embeddings (e.g., randomization in embedding-space in a uniform ball around a given sentence) and then decode them. I'm currently attempting to use the following strategy with T5 and Huggingface Transformers:

Encode the text with T5Tokenizer.
Run a forward pass through the encoder with model.encoder. Use the last hidden state as the embedding. (I've tried .generate as well, but it doesn't allow me to use the decoder separately from the encoder.)
Perform any desired operations on the embedding.
The problematic step: Pass it through model.decoder and decode with the tokenizer.

I'm having trouble with (4). My sanity check: I set (3) to do nothing (no change to the embedding), and I check whether the resulting text is the same as the input. So far, that check always fails.

I get the sense that I'm missing something rather important (something to do with the lack of beam search or some other similar generation method?). I'm unsure of whether what I think is an embedding (as in (2)) is even correct.

How would I go about encoding a sentence embedding with T5, modifying it in that vector space, and then decoding it into generated text? Also, might another model be a better fit?

As a sample, below is my incredibly broken code, based on this:

t5_model = transformers.T5ForConditionalGeneration.from_pretrained("t5-large")
t5_tok = transformers.T5Tokenizer.from_pretrained("t5-large")
text = "Foo bar is typing some words."
input_ids = t5_tok(text, return_tensors="pt").input_ids
encoder_output_vectors = t5_model.encoder(input_ids, return_dict=True).last_hidden_state
# The rest is what I think is problematic:
decoder_input_ids = t5_tok("<pad>", return_tensors="pt", add_special_tokens=False).input_ids
decoder_output = t5_model.decoder(decoder_input_ids, encoder_hidden_states=encoder_output_vectors)
t5_tok.decode(decoder_output.last_hidden_state[0].softmax(0).argmax(1))

Solution

Much easier than anticipated! For anyone else looking for an answer, this page in HuggingFace's docs wound up helping me the most. Below is an example with code based heavily on that page.

First, to get the hidden layer embeddings:

encoder_input_ids = self.tokenizer(encoder_input_str, return_tensors="pt").input_ids
embeds = self.model.get_encoder()(
    encoder_input_ids.repeat_interleave(num_beams, dim=0),
    return_dict=True
)

Note that using repeat_interleave above is only necessary for decoding methods such as beam search. Otherwise, no repetition in the hidden layer embedding is necessary.

HuggingFace provides many methods for decoding, just as would be accessible via generate()'s options. These are documented in the article linked above. To provide an example of decoding using beam search with num_beams beams:

model_kwargs = {
    "encoder_outputs": encoder_outputs
}

# Define decoder start token ids
input_ids = torch.ones((num_beams, 1), device=model.device, dtype=torch.long)
input_ids = input_ids * model.config.decoder_start_token_id

# Instantiate three configuration objects for scoring
beam_scorer = BeamSearchScorer(
    batch_size=1,
    num_beams=num_beams,
    device=model.device,
)

logits_processor = LogitsProcessorList(
    [
        MinLengthLogitsProcessor(5, eos_token_id=model.config.eos_token_id),
    ]
)

stopping_criteria = StoppingCriteriaList([
    MaxLengthCriteria(max_length=max_length),
])

outputs = model.beam_search(input_ids, beam_scorer, logits_processor=logits_processor, stopping_criteria=stopping_criteria, **model_kwargs)

results = tokenizer.batch_decode(outputs, skip_special_tokens=True)

Similar approaches can be taken for greedy and contrastive search, with different parameters. Similarly, different stopping criteria can be used.