python huggingface-transformers huggingface

What is the function of the `text_target` parameter in Huggingface's `AutoTokenizer`?

I'm following the guide here: https://huggingface.co/docs/transformers/v4.28.1/tasks/summarization There is one line in the guide like this:

labels = tokenizer(text_target=examples["summary"], max_length=128, truncation=True)

I don't understand the function of the text_target parameter.

I tried the following code and the last two lines gave exactly the same results.

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('t5-small')
text = "Weiter Verhandlung in Syrien."
tokenizer(text_target=text, max_length=128, truncation=True)
tokenizer(text, max_length=128, truncation=True)

The docs just say text_target (str, List[str], List[List[str]], optional) — The sequence or batch of sequences to be encoded as target texts. I don't really understand. Is there some situations when setting text_target will give you a different result?

Solution

Sometimes it is necessary to look at the code:

if text is None and text_target is None:
    raise ValueError("You need to specify either `text` or `text_target`.")
if text is not None:
    # The context manager will send the inputs as normal texts and not text_target, but we shouldn't change the
    # input mode in this case.
    if not self._in_target_context_manager:
        self._switch_to_input_mode()
    encodings = self._call_one(text=text, text_pair=text_pair, **all_kwargs)
if text_target is not None:
    self._switch_to_target_mode()
    target_encodings = self._call_one(text=text_target, text_pair=text_pair_target, **all_kwargs)
# Leave back tokenizer in input mode
self._switch_to_input_mode()

if text_target is None:
    return encodings
elif text is None:
    return target_encodings
else:
    encodings["labels"] = target_encodings["input_ids"]
    return encodings

As you can see in the above snippet, both text and text_target are passed to self._call_one() to encode them (note that text_target is passed as the text parameter). That means the encoding of the same string as text or text_target will be identical as long as _switch_to_target_mode() doesn't do anything special.

The conditions at the end of the function answer your question:

When you only provide text you will retrieve the encoding of it.
When you only provide text_target you will retrieve the encoding of it.
When you provide text and text_target you will retrieve the encoding of text and the token ids of text_target as the value of the labels key.

To be honest, I think the implementation is a bit unintuitive. I would expect that passing the text_target would return an object that only contains the labels key. I assume that they wanted to keep their output objects and the respective documentation simple and therefore went for this implementation. Or there is a model where it actually makes sense that I am unaware of.