Search code examples
pythontranslationhuggingface-transformershuggingface-tokenizers

Strange results with huggingface transformer[marianmt] translation of larger text


I need to translate large amounts of text from a database. Therefore, I've been dealing with transformers and models for a few days. I'm absolutely no data science expert and unfortunately I don't get any further.

The problem starts with longer text. The 2nd issue is the usual-maximum token size (512) of the sequencers. Just truncating is not really an option. Here I did find a work-around, but it does not work properly and the result is a word salad on longer texts (>300 sequences)

Here an Example (please ignore the warnings, this is another issues - which does not hurt currently that much);

If i take the Example Sentence 2 (55 seq) or 5 times (163 sequences) - no issues.

But it get messed up with e.g. 433 sequences (the 3rd green text block in the screenshot).

enter image description here

With more than 510 sequences, I tried to split it up in chunks as in the upper described link. But the result here is as well pretty strange.

I am pretty sure - that I have more than just one mistake and underestimated this topic. But I see no alternative (free/cheap) way for translating big amount of text.

Can you guys help me out? Which (thinking) errors do you see and how would you suggest to solve the issues? Thank you very much.

enter image description here

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch

if torch.cuda.is_available():  
  dev = "cuda"
else:  
  dev = "cpu" 
device = torch.device(dev)
 
mname = 'Helsinki-NLP/opus-mt-de-en'
tokenizer = AutoTokenizer.from_pretrained(mname)
model = AutoModelForSeq2SeqLM.from_pretrained(mname)
model.to(device)

chunksize = 512

text_short = "Nach nur sieben Seiten appellierte man an die Wählerinnen und Wähler, sich richtig zu entscheiden, nämlich für Frieden, Freiheit, Sozialismus. "
text_long = text_short
#this loop is just for debugging/testing and simulating long text
for x in range(30):
    text_long = text_long + text_short

tokens = tokenizer.encode_plus(text_long, return_tensors="pt", add_special_tokens=True, padding=False, truncation=False).to(device)
str_len = len(tokens['input_ids'][0])

if str_len > 510:
    # split into chunks of 510 tokens, we also convert to list (default is tuple which is immutable)
    input_id_chunks = list(tokens['input_ids'][0].split(chunksize - 2))
    mask_chunks = list(tokens['attention_mask'][0].split(chunksize - 2))

    cnt = 1
    for tensor in input_id_chunks:
        print('\033[96m' + 'chunk ' + str(cnt) + ': ' + str(len(tensor)) + '\033[93m')
        cnt += 1
    
    # loop through each chunk
    # https://towardsdatascience.com/how-to-apply-transformers-to-any-length-of-text-a5601410af7f
    for i in range(len(input_id_chunks)):
        # add CLS and SEP tokens to input IDs
        input_id_chunks[i] = torch.cat([
            torch.tensor([101]).to(device), input_id_chunks[i], torch.tensor([102]).to(device)
        ])
        # add attention tokens to attention mask
        mask_chunks[i] = torch.cat([
            torch.tensor([1]).to(device), mask_chunks[i], torch.tensor([1]).to(device)
        ])
        # get required padding length
        pad_len = chunksize - input_id_chunks[i].shape[0]
        # check if tensor length satisfies required chunk size
        if pad_len > 0:
            # if padding length is more than 0, we must add padding
            input_id_chunks[i] = torch.cat([
                input_id_chunks[i], torch.Tensor([0] * pad_len).to(device)
            ])
            mask_chunks[i] = torch.cat([
                mask_chunks[i], torch.Tensor([0] * pad_len).to(device)
            ])
   
    input_ids = torch.stack(input_id_chunks)
    attention_mask = torch.stack(mask_chunks)
    input_dict = {'input_ids': input_ids.long(), 'attention_mask': attention_mask.int()}
    
    outputs = model.generate(**input_dict)
    #this doesnt work - following error comes to the console --> "host_softmax" not implemented for 'Long'
    #probs = torch.nn.functional.softmax(outputs[0], dim=-1)
    # probs
    # probs = probs.mean(dim=0)
    # probs
  
else:
    tokens["input_ids"] = tokens["input_ids"][:, :512] #truncating normally not necessary
    tokens["attention_mask"] = tokens["attention_mask"][:, :512]
    outputs = model.generate(**tokens)

decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)
print('\033[94m' + str(str_len))
print('\033[92m' + decoded)

Remark; following libs are necessary:

pip3 install torch==1.9.0+cu102 torchvision==0.10.0+cu102 torchaudio===0.9.0 -f https://download.pytorch.org/whl/torch_stable.html

pip install transformers

pip install sentencepiece


Solution

  • To translate long texts with transformers you can split your text by paragraphs, paragraphs split by sentence and after that feed sentences to your model in batches. In any case it is better to translate with MarianMT in a sentence-by-sentence way, because it can lose some parts if you feed a long text as a one piece to it.

    from transformers import MarianMTModel, MarianTokenizer
    from nltk.tokenize import sent_tokenize
    from nltk.tokenize import LineTokenizer
    import math
    import torch
    
    if torch.cuda.is_available():  
      dev = "cuda"
    else:  
      dev = "cpu" 
    device = torch.device(dev)
     
    mname = 'Helsinki-NLP/opus-mt-de-en'
    tokenizer = MarianTokenizer.from_pretrained(mname)
    model = MarianMTModel.from_pretrained(mname)
    model.to(device)
    
    lt = LineTokenizer()
    batch_size = 8
    
    text_short = "Nach nur sieben Seiten appellierte man an die Wählerinnen und Wähler, sich richtig zu entscheiden, nämlich für Frieden, Freiheit, Sozialismus. "
    text_long = text_short * 30
    
    paragraphs = lt.tokenize(text_long)   
    translated_paragraphs = []
    
    for paragraph in paragraphs:
        sentences = sent_tokenize(paragraph)
        batches = math.ceil(len(sentences) / batch_size)     
        translated = []
        for i in range(batches):
            sent_batch = sentences[i*batch_size:(i+1)*batch_size]
            model_inputs = tokenizer(sent_batch, return_tensors="pt", padding=True, truncation=True, max_length=500).to(device)
            with torch.no_grad():
                translated_batch = model.generate(**model_inputs)
            translated += translated_batch
        translated = [tokenizer.decode(t, skip_special_tokens=True) for t in translated]
        translated_paragraphs += [" ".join(translated)]
    
    translated_text = "\n".join(translated_paragraphs)