python-3.x pandas loops huggingface-transformers huggingface-tokenizers

Apply transformer model to each row in a pandas column

I have a situation where I want to apply a translation model to each and every row in one of data frame columns.

The translation code that I am using :

from transformers import FSMTForConditionalGeneration, FSMTTokenizer
mname = "allenai/wmt19-de-en-6-6-big"
tokenizer = FSMTTokenizer.from_pretrained(mname)
model = FSMTForConditionalGeneration.from_pretrained(mname)
#Loop here for all rows in the German_Text column

input_ids = tokenizer.encode(input, return_tensors="pt")
outputs = model.generate(input_ids)
decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(decoded)

I want to apply this model to the following column and create a new translated column post this:

German_Text                     English_Text
Wie geht es dir heute
mir geht es gut

The column English text will consist of the translated text from the model above and hence I would like to apply that model to each row in the German_text column to create corresponding translations in the English_Text column

Solution

All you need to do is to wrap the steps into a function and use the apply function of your dataframe:

import pandas as pd
from transformers import FSMTForConditionalGeneration, FSMTTokenizer

mname = "allenai/wmt19-de-en-6-6-big"
tokenizer = FSMTTokenizer.from_pretrained(mname)
model = FSMTForConditionalGeneration.from_pretrained(mname)

df = pd.DataFrame(['Wie geht es dir heute', 'mir geht es gut'], columns=['German_Text'])

def translationPipeline(text):
    input_ids = tokenizer.encode(text, return_tensors="pt")
    outputs = model.generate(input_ids)
    decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return decoded

df['English_Text']=df['German_Text'].apply(translationPipeline)
print(df)

Output:

             German_Text             English_Text
0  Wie geht es dir heute  How are you doing today
1        mir geht es gut                 I'm fine