I would like to read a gzip that I have downloaded from an Azure blob storage:
myStorageStreamDownloaderObject = blob_service_client.
get_container_client('myContainer').
download_blob(myBlob.json.gzip)
(Note that the file contains a json and it has extension .gzip (not .gz).)
Attempt 1:
import gzip as gzip
contents = myStorageStreamDownloaderObject.readall()
result = gzip.decompress(contents)
The yields:
contents = b'\x1f\x8b\x08\x00\x00\x00\x00\x00\x04\x00\xbd\x9d=\xb3\xe6\xb8q\x85\xff\xcb\xc4\xab)|4>8\x99...
result = b''
Attempt 2:
from io import BytesIO
import pandas as pd
with BytesIO() as input_blob:
myStorageStreamDownloaderObject.readinto(input_blob)
input_blob.seek(0)
df = pd.read_csv(input_blob, compression='gzip')
This yields:
EmptyDataError: No columns to parse from file
It does work with Spark
df = spark.
read.option("compression", "gzip").
schema(json_schema).
json([f"wasbs://{container_name}@geoexportpreprodanwb.blob.core.windows.net/{blob_name}"])
This nicely returns a dataframe with 275 rows. But I am looking for a solution without Spark.
The problem had a very basic cause:
I did not realize myStorageStreamDownloaderObject
is like a generator.
My full code was something like this:
contents = myStorageStreamDownloaderObject.readall()
print(contents)
# This yields: b'\x1f\x8b\x08\x00\x00\x00\x00\x00\x04\x00\xbd\x9d=\xb3\xe6\xb8q\x85\xff\xcb\xc4\xab)|4>8\x99...
contents = myStorageStreamDownloaderObject.readall()
result = gzip.decompress(contents)
print(result)
# This yields: b''
But because myStorageStreamDownloaderObject
is like a generator I should not have called it twice. The following does work:
contents = myStorageStreamDownloaderObject.readall()
print(contents)
result = gzip.decompress(contents)
print(result)
(My apologies for inaccurately summarizing my code in the question.)