I'm trying to implement pretty simple flow - download gzipped file from GCS, process it and upload it back.
I had no issues with the first part, since blob.download_as_bytes() not only downloads the file but also inflates it on the fly, so the result of this method is serialized object, that I only need to deserialize.
My problem is with blob.upload_from_string. I have written the following code:
def download_object(blob: storage.Blob)
bytes_buffer = blob.download_as_bytes()
dict_data = orjson.loads(bytes_buffer)
return load_from_dict(dict_data)
def upload_object(blob: storage.Blob, obj)
dict_data = obj.save_as_dict()
json_string = orjson.dumps(dict_data).decode("utf-8")
compressed = gzip.compress(json_string.encode("utf-8"))
blob.upload_from_string(compressed)
With this code I'm able to download and reupload back file, which was gziped on upload via gsutil cli previously with following command:
gsutil cp -Z ~/data.json gs://bucket/
But when I try to repeat the flow for the resulting file (which is present and by size, seems to be gziped) I'm getting an error on download_object for line
dict_data = orjson.loads(bytes_buffer)
orjson.JSONDecodeError: str is not valid UTF-8: surrogates not allowed: line 1 column 1 (char 0)
What should upload_object
look like to do the same as gsutil command does?
Thanks to @John Hanley for his question regarding metadata, this is exactly what I was missing. The proper solution would be:
def download_object(blob: storage.Blob)
bytes_buffer = blob.download_as_bytes()
dict_data = orjson.loads(bytes_buffer)
return load_from_dict(dict_data)
def upload_object(blob: storage.Blob, obj)
dict_data = obj.save_as_dict()
json_string = orjson.dumps(dict_data)
compressed = gzip.compress(json_string)
blob.content_encoding = "gzip" # <- This line made all the difference
blob.upload_from_string(compressed, "application/json")