Search code examples
python-3.xencodingutf-8google-apigoogle-translation-api

Google Translate API returns non UTF8 characters


Resolve See in the end of this post for the solution

Good evening.

Im trying to play with the google translate v3 api.

And I arrive on a mystical encoding issue.

I do this :

def translate_text_langueTarget(texteToTranslate, langueTarget):
     parent = client.location_path(project_id, location)
     langueOrigin = detect_language(texteToTranslate)
     if (langueOrigin == "en" and langueTarget == "en"):
         return(texteToTranslate)
     try:
         response = client.translate_text(
             parent=parent,
             contents=[texteToTranslate],
             mime_type='text/plain',
             source_language_code=langueOrigin,
             target_language_code=langueTarget)
         translatedTexte = str(response.translations)[19:-3]
     except:
         translatedTexte = "Sorry my friend, the translation is lost on the internet"

     print(response)
     print(type(response))
     print(response.translations)
     print(type(response.translations))
     return(translatedTexte)

I call this with

stringToTrad = "prefer"
langTarget = "fr"
translateString = translate_text_langueTarget(stringToTrad, langTarget)

And I expecte to have "préféré" in answer

But I obtain : "pr\303\251f\303\251rer"

I have try to look after this error with a bit of debug in my code, with :

print(response)
print(type(response))
print(response.translations)
print(type(response.translations))

I think it's a problem of encoding but i can't find a answer to my problem.

I work in python and my scrip is tag :

#! /usr/bin/env python3
# coding: utf-8

in the header

Do you have an idea ?

Resolve. I use :

translatedTexte = codecs.escape_decode(translatedTexte)[0]
translatedTexte = translatedTexte.decode("utf8")

Solution

  • API of Google Translate gives you UTF-8 text. You got c3 a9 (303 251 as octal numbers) which it is really é, as expected.

    So your code take the correct UTF-8 file and it writes it as maybe wrong encoding.

    This line is just a myth, not useful:

    # coding: utf-8
    

    If you want that your code interpret input and output as UTF-8, you should explicitly say so. With your code, I assume that (one problem) is that you use print (better to write into a file). On Windows, by default, terminals are not UTF-8, but old "Windows ANSI like and extended also know as Windows 1252" encoding.

    So write into a file (with explicit UTF-8 encoding), or just change terminal settings, to have UTF-8 terminal. In addition, you may have escape sequences, on results. To me, it smell much, to have results written in octal way. Not a think of standard Python (and it will complain, about wrong encoding). You may need to parse the response, to translate escape sequences.