Search code examples
pythonmachine-learningnlppytorchhuggingface-transformers

Maximum Input Length of words/sentences of the Pegasus Model in the Transformers library


In the Transformers library, what is the maximum input length of words and/or sentences of the Pegasus model? I read in the Pegasus research paper that the max was 512 tokens, but how many words and/or sentences is that? Also, can you increase the maximum number of 512 tokens?


Solution

  • In the Transformers library, what is the maximum input length of words and/or sentences of the Pegasus model? It actually depends on your pretraining. You can create a pegagsus model that supports a length of 100 tokens or 10000 tokens. For example the model google/pegasus-cnn_dailymail supports 1024 tokens, while google/pegasus-xsum supports 512:

    from transformers import PegasusTokenizerFast
    
    t = PegasusTokenizerFast.from_pretrained("google/pegasus-xsum")
    t2 = PegasusTokenizerFast.from_pretrained("google/pegasus-cnn_dailymail")
    print(t.max_len_single_sentence)
    print(t2.max_len_single_sentence)
    

    Output:

    511
    1023
    

    The numbers are reduced by one because of the special token that is added to each sequence.

    I read in the Pegasus research paper that the max was 512 tokens, but how many words and/or sentences is that?

    That depends on your vocabulary.

    from transformers import PegasusTokenizerFast
    t = PegasusTokenizerFast.from_pretrained("google/pegasus-xsum")
    print(t.tokenize('This is a test sentence'))
    print("I know {} tokens".format(len(t)))
    

    Output:

    ['▁This', '▁is', '▁a', '▁test', '▁sentence']
    I know 96103 tokens
    

    A word can be a token but it can also be split into several tokens:

    print(t.tokenize('neuropsychiatric conditions'))
    

    Output:

    ['▁neuro', 'psych', 'i', 'atric', '▁conditions']
    

    Also, can you increase the maximum number of 512 tokens?

    Yes, you can train a model with a pegasus architecture for a different input length but this is costly.