import time
start = time.time()
import pandas as pd
from deep_translator import GoogleTranslator
data = pd.read_excel(r"latestdata.xlsx")
translatedata = data['column']. fillna('novalue')
list = []
for i in translatedata:
finaldata = GoogleTranslator(source='auto', target='english').translate(i)
print(finaldata)
list.append(finaldata)
df = pd.DataFrame(list, columns=['Translated_values'])
df.to_csv(r"jobdone.csv", sep= ';')
end = time.time()
print(f"Runtime of the program is {end - start}")
I have data of 220k points and trying to translate a column data At first I tried to use pool method parallel program but got an error that I can not access API several time at once. My question is if there is other way to improve performance of code that I have right now.
# 4066.826668739319 with just 10000 data all together.
# 3809.4675991535187 computation time when I run in 2 batch's of 5000
Q :
" ... is ( there ) other way to improve performance of code ...? "
A :
Yes, there are a few ways,
yet do not expect anything magical, as you have already reported the API-provider's throttling/blocking somewhat higher levels of concurrent API-call from being served
There still might be some positive effects from latency-masking tricks from a just-[CONCURRENT]
orchestration of several API-calls, as the End-to-End latencies are principally "long" as going many-times across the over-the-"network"-horizons and having also some remarkable server-side TAT-latency on translation-matching engines.
Details matter, a lot...
A performance boosting code-template to start with
( avoiding 220k+ repeated local-side overheads' add-on costs ) :
import time
import pandas as pd
from deep_translator import GoogleTranslator as gXLTe
xltDF = pd.read_excel( r"latestdata.xlsx" )['column'].fillna( 'novalue' )
resDF = xltDF.copy( deep = True )
PROC_ns_START = time.perf_counter_ns()
#________________________________________________________ CRITICAL SECTION: start
for i in range( len( xltDF ) ):
resDF.iloc( i ) = gXLTe( source = 'auto',
target = 'english'
).translate( xltDF.iloc( i ) )
#________________________________________________________ CRITICAL SECTION: end
PROC_ns_END = time.perf_counter_ns()
resDF.to_csv( r"jobdone.csv",
sep = ';'
)
print( f"Runtime was {0:} [ns]".format( PROC_ns_END - PROC_ns_START ) )
Tips for performance boosting :
[CONCURRENT]
accesses to the same DataFrame-objects, best using non-overlapping, separate (thread-private) block-iterators over disjunct halves ( for a pair of threads ) over disjunct thirds ( for 3 threads ) etc...def thread_hosted_blockCRAWLer( i_start, i_end ):
for i in range( i_start, i_end ):
while True:
try:
resDF.iloc( i ) = gXLTe( source = 'auto',
target = 'english'
).translate( xltDF.iloc( i ) )
# SUCCEDED
break
except:
# FAILED
print( "EXC: _blockCRAWLer() on index ", i )
time.sleep( ... )
# be careful here, not to get on API-provider's BLACK-LIST
continue
Do not hesitate to go tuning & tweaking - and anyway, keep us posted how fast you managed to get, that's fair, isn't it?