Search code examples
pythonhuggingface-transformershuggingface

What is the function of the `text_target` parameter in Huggingface's `AutoTokenizer`?


I'm following the guide here: https://huggingface.co/docs/transformers/v4.28.1/tasks/summarization There is one line in the guide like this:

labels = tokenizer(text_target=examples["summary"], max_length=128, truncation=True)

I don't understand the function of the text_target parameter.

I tried the following code and the last two lines gave exactly the same results.

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('t5-small')
text = "Weiter Verhandlung in Syrien."
tokenizer(text_target=text, max_length=128, truncation=True)
tokenizer(text, max_length=128, truncation=True)

The docs just say text_target (str, List[str], List[List[str]], optional) — The sequence or batch of sequences to be encoded as target texts. I don't really understand. Is there some situations when setting text_target will give you a different result?


Solution

  • Sometimes it is necessary to look at the code:

    if text is None and text_target is None:
        raise ValueError("You need to specify either `text` or `text_target`.")
    if text is not None:
        # The context manager will send the inputs as normal texts and not text_target, but we shouldn't change the
        # input mode in this case.
        if not self._in_target_context_manager:
            self._switch_to_input_mode()
        encodings = self._call_one(text=text, text_pair=text_pair, **all_kwargs)
    if text_target is not None:
        self._switch_to_target_mode()
        target_encodings = self._call_one(text=text_target, text_pair=text_pair_target, **all_kwargs)
    # Leave back tokenizer in input mode
    self._switch_to_input_mode()
    
    if text_target is None:
        return encodings
    elif text is None:
        return target_encodings
    else:
        encodings["labels"] = target_encodings["input_ids"]
        return encodings
    

    As you can see in the above snippet, both text and text_target are passed to self._call_one() to encode them (note that text_target is passed as the text parameter). That means the encoding of the same string as text or text_target will be identical as long as _switch_to_target_mode() doesn't do anything special.

    The conditions at the end of the function answer your question:

    1. When you only provide text you will retrieve the encoding of it.
    2. When you only provide text_target you will retrieve the encoding of it.
    3. When you provide text and text_target you will retrieve the encoding of text and the token ids of text_target as the value of the labels key.

    To be honest, I think the implementation is a bit unintuitive. I would expect that passing the text_target would return an object that only contains the labels key. I assume that they wanted to keep their output objects and the respective documentation simple and therefore went for this implementation. Or there is a model where it actually makes sense that I am unaware of.