Search code examples
pythongoogle-cloud-platformcharacter-encodinggoogle-cloud-storage

String encoding issue when uploading to Cloud Storage using the Python SDK


I am trying to use the python client to upload a string to a blob.

When I am printing the string in my console - all is well, when uploaded to the bucket it is full of strange encodings errors, such as:

I am bitterly disappointed – he should take the right to vote very seriously.

Here is my code:

from datetime import date
from google.cloud import storage

URL_TEXT_BUCKET = 'BUCKET-NAME'

client = storage.Client()
bucket = client.get_bucket(URL_TEXT_BUCKET)

def store_url_content(text, key):
    today = str(date.today())
    blob = bucket.blob(today + '/' + key)
    blob.upload_from_string(text)

I have tried setting encoding_type='utf8', but with no luck the docs do not state the options or best practices.

EDIT: I have also tried to encode my text as utf8, by calling:

text = text.encode('utf8')

Whilst this made a change in the viewer and seemed to replace some elements with what I believe to be bytes and also prepending a b b'some text x\u023 more' - the final result on GCS was the same.

EDIT 2: The issue is with the Google Console viewer, downloading the file back displays all fine...

Would be great if someone with knowledge about the GCP console could help fix why the text is not being rendered correctly.


Solution

  • I came across this problem today. As you already wrote, the file is correctly encoded with utf-8 in the bucket, but displays with encoding artifacts.

    What you're seeing is not "Google Console viewer", but simply your browser displaying the file. The issue is that the default upload content type is text/plain (see documentation). You can force utf-8 by passing content_type to the upload_from_string method with charset like:

    blob.upload_from_string(text, content_type='text/plain; charset=utf-8')
    

    This will make GCS serve the file with the right Content-Type header to your browser.

    Alternatively, Firefox has "View > Repair Text Encoding" that you can use.