I have a name
column with different languages.
In the end, I need English.
When I use just one variable it works but how can I do this for the complete dataframe for 1 or more columns?
from deep_translator import GoogleTranslator
reader = "df"
NAME_ORIG = "ich suche den namen"
translated = GoogleTranslator(source='auto', target='en').translate(NAME_ORIG))
from pyspark.sql import functions as F
from pyspark.sql import types as T
spark = ...
df = spark.createDataFrame([
['ich suche den namen'],
['guten tag']],
['name'])
@F.udf(returnType=T.StringType())
def translate(input):
from deep_translator import GoogleTranslator
return GoogleTranslator(source='auto', target='en').translate(input)
df.withColumn('translation', translate(F.col('name'))).show()
The udf approach creates a new GoogleTranslator
object for each row.
The documentation of deep-translator says
You can also reuse the Translator class and change/update its properties.
(Notice that this is important for performance too, since instantiating new objects is expensive)
Reusing the translator object can be achieved by using mapPartitions. Only one translator object per partition will be created:
def translatePartition(rows):
from deep_translator import GoogleTranslator
dt = GoogleTranslator(source='auto', target='en')
for row in rows:
yield (row['name'], dt.translate(row['name']))
df.rdd.mapPartitions(translatePartition).toDF(["name", "translation"]).show()
The deep-translator api offers a translate_batch function. This function can also be used by preparing the batches inside of mapPartitions
:
BATCH_SIZE=5
def translatePartitionWithBatch(rows):
from deep_translator import GoogleTranslator
dt = GoogleTranslator(source='auto', target='en')
def translateBatch(batch):
translations = dt.translate_batch(batch)
for text, translation in zip(batch, translations):
yield (text, translation)
batch = []
for row in rows:
batch.append(row['name'])
if( len(batch) >= BATCH_SIZE):
yield from translateBatch(batch)
batch = []
if( len(batch) > 0 ):
yield from translateBatch(batch)
df.rdd.mapPartitions(translatePartitionWithBatch).toDF(["name", "translation"]).show()
Using GoogleTranslator.translate_batch
instead of GoogleTranslator.translate
may or may not improve the performance further.
In all three approaches the output is the same:
+-------------------+------------------------+
|name |translation |
+-------------------+------------------------+
|ich suche den namen|I'm looking for the name|
|guten tag |Good day |
+-------------------+------------------------+