I need to translate large amounts of text from a database. Therefore, I've been dealing with transformers and models for a few days. I'm absolutely no data science expert and unfortunately I don't get any further.
The problem starts with longer text. The 2nd issue is the usual-maximum token size (512) of the sequencers. Just truncating is not really an option. Here I did find a work-around, but it does not work properly and the result is a word salad on longer texts (>300 sequences)
Here an Example (please ignore the warnings, this is another issues - which does not hurt currently that much);
If i take the Example Sentence 2 (55 seq) or 5 times (163 sequences) - no issues.
But it get messed up with e.g. 433 sequences (the 3rd green text block in the screenshot).
With more than 510 sequences, I tried to split it up in chunks as in the upper described link. But the result here is as well pretty strange.
I am pretty sure - that I have more than just one mistake and underestimated this topic. But I see no alternative (free/cheap) way for translating big amount of text.
Can you guys help me out? Which (thinking) errors do you see and how would you suggest to solve the issues? Thank you very much.
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch
if torch.cuda.is_available():
dev = "cuda"
else:
dev = "cpu"
device = torch.device(dev)
mname = 'Helsinki-NLP/opus-mt-de-en'
tokenizer = AutoTokenizer.from_pretrained(mname)
model = AutoModelForSeq2SeqLM.from_pretrained(mname)
model.to(device)
chunksize = 512
text_short = "Nach nur sieben Seiten appellierte man an die Wählerinnen und Wähler, sich richtig zu entscheiden, nämlich für Frieden, Freiheit, Sozialismus. "
text_long = text_short
#this loop is just for debugging/testing and simulating long text
for x in range(30):
text_long = text_long + text_short
tokens = tokenizer.encode_plus(text_long, return_tensors="pt", add_special_tokens=True, padding=False, truncation=False).to(device)
str_len = len(tokens['input_ids'][0])
if str_len > 510:
# split into chunks of 510 tokens, we also convert to list (default is tuple which is immutable)
input_id_chunks = list(tokens['input_ids'][0].split(chunksize - 2))
mask_chunks = list(tokens['attention_mask'][0].split(chunksize - 2))
cnt = 1
for tensor in input_id_chunks:
print('\033[96m' + 'chunk ' + str(cnt) + ': ' + str(len(tensor)) + '\033[93m')
cnt += 1
# loop through each chunk
# https://towardsdatascience.com/how-to-apply-transformers-to-any-length-of-text-a5601410af7f
for i in range(len(input_id_chunks)):
# add CLS and SEP tokens to input IDs
input_id_chunks[i] = torch.cat([
torch.tensor([101]).to(device), input_id_chunks[i], torch.tensor([102]).to(device)
])
# add attention tokens to attention mask
mask_chunks[i] = torch.cat([
torch.tensor([1]).to(device), mask_chunks[i], torch.tensor([1]).to(device)
])
# get required padding length
pad_len = chunksize - input_id_chunks[i].shape[0]
# check if tensor length satisfies required chunk size
if pad_len > 0:
# if padding length is more than 0, we must add padding
input_id_chunks[i] = torch.cat([
input_id_chunks[i], torch.Tensor([0] * pad_len).to(device)
])
mask_chunks[i] = torch.cat([
mask_chunks[i], torch.Tensor([0] * pad_len).to(device)
])
input_ids = torch.stack(input_id_chunks)
attention_mask = torch.stack(mask_chunks)
input_dict = {'input_ids': input_ids.long(), 'attention_mask': attention_mask.int()}
outputs = model.generate(**input_dict)
#this doesnt work - following error comes to the console --> "host_softmax" not implemented for 'Long'
#probs = torch.nn.functional.softmax(outputs[0], dim=-1)
# probs
# probs = probs.mean(dim=0)
# probs
else:
tokens["input_ids"] = tokens["input_ids"][:, :512] #truncating normally not necessary
tokens["attention_mask"] = tokens["attention_mask"][:, :512]
outputs = model.generate(**tokens)
decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)
print('\033[94m' + str(str_len))
print('\033[92m' + decoded)
Remark; following libs are necessary:
pip3 install torch==1.9.0+cu102 torchvision==0.10.0+cu102 torchaudio===0.9.0 -f https://download.pytorch.org/whl/torch_stable.html
pip install transformers
pip install sentencepiece
To translate long texts with transformers you can split your text by paragraphs, paragraphs split by sentence and after that feed sentences to your model in batches. In any case it is better to translate with MarianMT in a sentence-by-sentence way, because it can lose some parts if you feed a long text as a one piece to it.
from transformers import MarianMTModel, MarianTokenizer
from nltk.tokenize import sent_tokenize
from nltk.tokenize import LineTokenizer
import math
import torch
if torch.cuda.is_available():
dev = "cuda"
else:
dev = "cpu"
device = torch.device(dev)
mname = 'Helsinki-NLP/opus-mt-de-en'
tokenizer = MarianTokenizer.from_pretrained(mname)
model = MarianMTModel.from_pretrained(mname)
model.to(device)
lt = LineTokenizer()
batch_size = 8
text_short = "Nach nur sieben Seiten appellierte man an die Wählerinnen und Wähler, sich richtig zu entscheiden, nämlich für Frieden, Freiheit, Sozialismus. "
text_long = text_short * 30
paragraphs = lt.tokenize(text_long)
translated_paragraphs = []
for paragraph in paragraphs:
sentences = sent_tokenize(paragraph)
batches = math.ceil(len(sentences) / batch_size)
translated = []
for i in range(batches):
sent_batch = sentences[i*batch_size:(i+1)*batch_size]
model_inputs = tokenizer(sent_batch, return_tensors="pt", padding=True, truncation=True, max_length=500).to(device)
with torch.no_grad():
translated_batch = model.generate(**model_inputs)
translated += translated_batch
translated = [tokenizer.decode(t, skip_special_tokens=True) for t in translated]
translated_paragraphs += [" ".join(translated)]
translated_text = "\n".join(translated_paragraphs)