Search code examples
pythonpandasgoogle-translateconcurrent-processing

How to improve code performance ( using Google Translate API )


import time
start = time.time()
import pandas as pd
from deep_translator import GoogleTranslator
    
data = pd.read_excel(r"latestdata.xlsx")
translatedata = data['column']. fillna('novalue')
    
list = []
for i in translatedata:
      finaldata = GoogleTranslator(source='auto', target='english').translate(i)
      print(finaldata)
      list.append(finaldata)
    
df = pd.DataFrame(list, columns=['Translated_values'])
df.to_csv(r"jobdone.csv", sep= ';')
    
end = time.time()

print(f"Runtime of the program is {end - start}")

I have data of 220k points and trying to translate a column data At first I tried to use pool method parallel program but got an error that I can not access API several time at once. My question is if there is other way to improve performance of code that I have right now.

# 4066.826668739319     with just 10000 data all together.
# 3809.4675991535187    computation time when I run in 2 batch's of 5000

Solution

  • Q :
    " ... is ( there ) other way to improve performance of code ...? "

    A :
    Yes, there are a few ways,
    yet do not expect anything magical, as you have already reported the API-provider's throttling/blocking somewhat higher levels of concurrent API-call from being served

    There still might be some positive effects from latency-masking tricks from a just-[CONCURRENT] orchestration of several API-calls, as the End-to-End latencies are principally "long" as going many-times across the over-the-"network"-horizons and having also some remarkable server-side TAT-latency on translation-matching engines.

    Details matter, a lot...

    A performance boosting code-template to start with
    ( avoiding 220k+ repeated local-side overheads' add-on costs ) :

    import time
    import pandas as pd
    from   deep_translator import GoogleTranslator as gXLTe
        
    xltDF = pd.read_excel( r"latestdata.xlsx" )['column'].fillna( 'novalue' )
    resDF = xltDF.copy( deep = True )
    
    PROC_ns_START = time.perf_counter_ns()
    #________________________________________________________ CRITICAL SECTION: start
    for                  i in range( len( xltDF ) ):
             resDF.iloc( i ) = gXLTe( source = 'auto',
                                      target = 'english'
                                      ).translate( xltDF.iloc( i ) )
    
    #________________________________________________________ CRITICAL SECTION: end
    PROC_ns_END = time.perf_counter_ns()
    
    resDF.to_csv( r"jobdone.csv",
                  sep = ';'
                  )
    
    print( f"Runtime was {0:} [ns]".format( PROC_ns_END - PROC_ns_START ) )
    

    Tips for performance boosting :

    • if Google API-policy permits, we may increase thread-count, that participate on CRITICAL SECTION,
    • as the Python-interpreter threads are "inside" the same address-space and still are GIL-lock MUTEX-blocked, we may operate all just-[CONCURRENT] accesses to the same DataFrame-objects, best using non-overlapping, separate (thread-private) block-iterators over disjunct halves ( for a pair of threads ) over disjunct thirds ( for 3 threads ) etc...
    • as the Google API-policy is limiting attempts to overly concurrent access to the API-service, you shall build-in some, even naive-robustness
    def thread_hosted_blockCRAWLer( i_start, i_end ):
        for i in range( i_start, i_end ):
            while True:
                  try:
                      resDF.iloc( i ) = gXLTe( source = 'auto',
                                               target = 'english'
                                               ).translate( xltDF.iloc( i ) )
                      # SUCCEDED
                      break
                  except:
                      # FAILED
                      print( "EXC: _blockCRAWLer() on index ", i )
                      time.sleep( ... )
                      # be careful here, not to get on API-provider's BLACK-LIST
                      continue
    
    • if more time-related details per thread, may reuse this

    Do not hesitate to go tuning & tweaking - and anyway, keep us posted how fast you managed to get, that's fair, isn't it?