Search code examples
pythonpysparkazure-storagegzipazure-databricks

How to compress a json lines file and uploading to azure container?


I am working in databricks, and have a Pyspark dataframe that I am converting to pandas, and then to a json lines file and want to upload it to an Azure container (ADLS gen2). The file is large, and I wanted to compress it prior to uploading.

I am first converting the pyspark dataframe to pandas.

pandas_df = df.select("*").toPandas()

Then converting it to a newline delimited json:

json_lines_data = pandas_df.to_json(orient='records', lines=True)

Then writing to blob storage with the following function:

def upload_blob(json_lines_data, connection_string, container_name, blob_name): 
  blob_service_client = BlobServiceClient.from_connection_string(connection_string) 
  blob_client = blob_service_client.get_blob_client(container=container_name, blob=blob_name) 
  try: 
    blob_client.get_blob_properties() 
    blob_client.delete_blob() 
  # except if no delete necessary
  except: 
    pass
  blob_client.upload_blob(json_lines_data)

This is working fine, but the data is around 3 GB per file, and takes a long time to download so I would rather zip the files. Can anyone here help with how to compress the json lines file and upload it to the azure container? I have tried a lot of different things, and nothing is working.

If there is a better way to do this in databricks, I can change it. I did not write using databricks as I need to output 1 file and control the filename.


Solution

  • There is a way you can follow to compress the JSON file before uploading to blob storage.

    Here is code for converting the data to JSON and Convert to binary code(utf-8) and lastly compress it.

    Would suggest you add this code before uploading function.

    import json
    import gzip
    
    def compress_data(data):
        # Convert to JSON
        json_data = json.dumps(data, indent=2)
        # Convert to bytes
        encoded = json_data.encode('utf-8')
        # Compress
        compressed = gzip.compress(encoded)
    

    Reference: https://gist.github.com/LouisAmon/4bd79b8ab80d3851601f3f9016300ac4#file-json_to_gzip-py