deep-learning nlp huggingface-transformers machine-translation

mbart50 having trouble translating long texts/documents?

I'm new to NLP and MBART and sorry if my question sounds stupid. I'm having trouble with translating Korean long texts into English using MBart50.

I realized that it works fine with shorter texts (for example, a sentence). But when it comes to longer texts such as news, it always give me an error of "index out of range in self".

Here's my code:

from transformers import MBartForConditionalGeneration, MBart50Tokenizer
import streamlit as st
import csv


@st.cache_resource
def download_model():
    model_name = "facebook/mbart-large-50-many-to-many-mmt"
    model = MBartForConditionalGeneration.from_pretrained(model_name)
    tokenizer = MBart50Tokenizer.from_pretrained(model_name, src_lang="ko_KR")
    return model, tokenizer


model, tokenizer = download_model()

model_name = "facebook/mbart-large-50-many-to-many-mmt"
tokenizer.src_lang = "ko_KR"

with open('Korean_Translation.csv', 'w', newline='', encoding='UTF-8') as korean_translation:
    translation_writer = csv.writer(korean_translation)

    with open('original_text.txt', mode='r', encoding='UTF-8') as korean_original:
        original_lines = korean_original.readlines()
        for lines in original_lines:
            print(lines)
            encoded_korean_text = tokenizer(lines, return_tensors="pt")
            generated_tokens = model.generate(**encoded_korean_text,
                                              forced_bos_token_id=tokenizer.lang_code_to_id["en_XX"],
                                              max_length=99999999999999)
            out2 = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
            print(out2)
            translation_writer.writerow(out2)

The error it gives me looks like this:

2023-03-10 14:15:04.182 Uncaught app exception
Traceback (most recent call last):
  File "E:\Python 3.10.5\lib\site-packages\streamlit\runtime\scriptrunner\script_runner.py", line 565, in _run_script
    exec(code, module.__dict__)
  File "D:\Study\NLP\Multilingual_news_analysis\pythonProject\test.py", line 36, in <module>
    generated_tokens = model.generate(**encoded_korean_text,
  File "E:\Python 3.10.5\lib\site-packages\torch\autograd\grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "E:\Python 3.10.5\lib\site-packages\transformers\generation\utils.py", line 1252, in generate
    model_kwargs = self._prepare_encoder_decoder_kwargs_for_generation(
  File "E:\Python 3.10.5\lib\site-packages\transformers\generation\utils.py", line 617, in _prepare_encoder_decoder_kwargs_for_generation
    model_kwargs["encoder_outputs"]: ModelOutput = encoder(**encoder_kwargs)
  File "E:\Python 3.10.5\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "E:\Python 3.10.5\lib\site-packages\transformers\models\mbart\modeling_mbart.py", line 794, in forward
    embed_pos = self.embed_positions(input)
  File "E:\Python 3.10.5\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "E:\Python 3.10.5\lib\site-packages\transformers\models\mbart\modeling_mbart.py", line 133, in forward
    return super().forward(positions + self.offset)
  File "E:\Python 3.10.5\lib\site-packages\torch\nn\modules\sparse.py", line 160, in forward
    return F.embedding(
  File "E:\Python 3.10.5\lib\site-packages\torch\nn\functional.py", line 2210, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
IndexError: index out of range in self

Why is this happening? Is it because the text is too long? (about 600 characters) Because this won't happen with shorter texts (< 200 characters). How can I fix this? Thanks!

Solution

mBART50 has a maximum input length of 1024 subwords. It uses learned position embeddings. Therefore, when the input sequence is longer than the threshold, there is no embedding for that position. You can see in the stack trace that it happens in the encoder when calling self.embed_positions.

You can either split the texts into something shorter and still meaningful. In the worst case, you can turn on truncation to the maximum length when tokenizing the sentences.

A similar thing can happen in the decoder. When you set the maximum length to something longer than 1024, the decoder can run out of position embeddings.