Search code examples
azure-blob-storagegzip

Read gzip from Azure StorageStreamDownloader


I would like to read a gzip that I have downloaded from an Azure blob storage:

myStorageStreamDownloaderObject = blob_service_client.
    get_container_client('myContainer').
    download_blob(myBlob.json.gzip)

(Note that the file contains a json and it has extension .gzip (not .gz).)

Attempt 1:

import gzip as gzip
contents = myStorageStreamDownloaderObject.readall()
result   = gzip.decompress(contents)

The yields:

contents = b'\x1f\x8b\x08\x00\x00\x00\x00\x00\x04\x00\xbd\x9d=\xb3\xe6\xb8q\x85\xff\xcb\xc4\xab)|4>8\x99...
result   = b''

Attempt 2:

from io import BytesIO
import pandas as pd

with BytesIO() as input_blob:
    myStorageStreamDownloaderObject.readinto(input_blob)
    input_blob.seek(0)
    df = pd.read_csv(input_blob, compression='gzip')

This yields:

EmptyDataError: No columns to parse from file

It does work with Spark

df = spark.
    read.option("compression", "gzip").
    schema(json_schema).
    json([f"wasbs://{container_name}@geoexportpreprodanwb.blob.core.windows.net/{blob_name}"])

This nicely returns a dataframe with 275 rows. But I am looking for a solution without Spark.


Solution

  • The problem had a very basic cause: I did not realize myStorageStreamDownloaderObject is like a generator.

    My full code was something like this:

    contents = myStorageStreamDownloaderObject.readall()
    print(contents)
    # This yields: b'\x1f\x8b\x08\x00\x00\x00\x00\x00\x04\x00\xbd\x9d=\xb3\xe6\xb8q\x85\xff\xcb\xc4\xab)|4>8\x99...
    
    contents = myStorageStreamDownloaderObject.readall()
    result   = gzip.decompress(contents)
    print(result)
    # This yields: b''
    

    But because myStorageStreamDownloaderObject is like a generator I should not have called it twice. The following does work:

    contents = myStorageStreamDownloaderObject.readall()
    print(contents)
    result = gzip.decompress(contents)
    print(result)
    

    (My apologies for inaccurately summarizing my code in the question.)