Search code examples
python-3.xgoogle-cloud-storage

GCS Python lib blob.download_as_bytes() vs blob.upload_from_string


I'm trying to implement pretty simple flow - download gzipped file from GCS, process it and upload it back.

I had no issues with the first part, since blob.download_as_bytes() not only downloads the file but also inflates it on the fly, so the result of this method is serialized object, that I only need to deserialize.

My problem is with blob.upload_from_string. I have written the following code:

def download_object(blob: storage.Blob)
    bytes_buffer = blob.download_as_bytes()
    dict_data = orjson.loads(bytes_buffer)
    return load_from_dict(dict_data)

def upload_object(blob: storage.Blob, obj)
    dict_data = obj.save_as_dict()
    json_string = orjson.dumps(dict_data).decode("utf-8")
    compressed = gzip.compress(json_string.encode("utf-8"))
    blob.upload_from_string(compressed)

With this code I'm able to download and reupload back file, which was gziped on upload via gsutil cli previously with following command:

gsutil cp -Z ~/data.json gs://bucket/

But when I try to repeat the flow for the resulting file (which is present and by size, seems to be gziped) I'm getting an error on download_object for line

dict_data = orjson.loads(bytes_buffer)

orjson.JSONDecodeError: str is not valid UTF-8: surrogates not allowed: line 1 column 1 (char 0)

What should upload_object look like to do the same as gsutil command does?


Solution

  • Thanks to @John Hanley for his question regarding metadata, this is exactly what I was missing. The proper solution would be:

    def download_object(blob: storage.Blob)
        bytes_buffer = blob.download_as_bytes()
        dict_data = orjson.loads(bytes_buffer)
        return load_from_dict(dict_data)
    
    def upload_object(blob: storage.Blob, obj)
        dict_data = obj.save_as_dict()
        json_string = orjson.dumps(dict_data)
        compressed = gzip.compress(json_string)
        blob.content_encoding = "gzip" # <- This line made all the difference
        blob.upload_from_string(compressed, "application/json")