Search code examples
pythondatasettranslationgoogle-translatedata-augmentation

Python Google Translate API error : How to translate a large amount of data


My problem

I would like to use a kind of data-augmentation method for NLP consisting of back-translating dataset.

Basically, I have a large dataset (SNLI), consisting of 1 100 000 english sentences. What I need to do is : translate these sentences in a language, and translate it back to English.

I may have to do this for several language. So I have a lot of translations to do.

I need a free solution.


What I did so far

I tried several python module for translation, but due to recent changes in Google Translate API, most of them do not work. googletrans seems to work if we apply this solution.

However, it is not working for big dataset. There is a limit of 15K characters by Google (as pointed out by this, this and this). The first link show a supposed work-around.


Where I am blocked

Even if I apply the work-around (initializing the Translator every iteration), it is not working, and I got the following error :

json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

I tried using proxies and others Google translate URLs :

URLS = ['translate.google.com', 'translate.google.co.kr', 'translate.google.ac', 'translate.google.ad', 'translate.google.ae', ...]

proxies = {    'http': '1.243.64.63:48730',   'https': '59.11.98.253:42645', }

t = Translator(service_urls=URLS, proxies=proxies)

But it's not changing anything.


Note

My problem might come from the fact that I am using multi-threading : 100 workers for translating the whole dataset. If they work in parallel, maybe they use more than 15k characters together.

But I should use multi-threading. If I don't, it will take several weeks to translate the whole dataset...


My question

How do I fix this error so I can translate all sentences ?

If it's not possible, is there any free alternative, to get machine translation in Python (not mandatory to use Google Translate), for such a big dataset ?


Solution

  • One million characters is pretty much text to be translated.

    Currently, the Google Cloud Translation V3 offers a free tier quota that you may want to use (1-500,000 characters free per month). Since it doesn't seem to be enough for your use case, you probably need to create more than one billing accounts or wait for a month to translate more text.

    Check this link to know how you can perform a text translation with python.