python pyspark azure-storage gzip azure-databricks

How to compress a json lines file and uploading to azure container?

I am working in databricks, and have a Pyspark dataframe that I am converting to pandas, and then to a json lines file and want to upload it to an Azure container (ADLS gen2). The file is large, and I wanted to compress it prior to uploading.

I am first converting the pyspark dataframe to pandas.

pandas_df = df.select("*").toPandas()

Then converting it to a newline delimited json:

json_lines_data = pandas_df.to_json(orient='records', lines=True)

Then writing to blob storage with the following function:

def upload_blob(json_lines_data, connection_string, container_name, blob_name): 
  blob_service_client = BlobServiceClient.from_connection_string(connection_string) 
  blob_client = blob_service_client.get_blob_client(container=container_name, blob=blob_name) 
  try: 
    blob_client.get_blob_properties() 
    blob_client.delete_blob() 
  # except if no delete necessary
  except: 
    pass
  blob_client.upload_blob(json_lines_data)

This is working fine, but the data is around 3 GB per file, and takes a long time to download so I would rather zip the files. Can anyone here help with how to compress the json lines file and upload it to the azure container? I have tried a lot of different things, and nothing is working.

If there is a better way to do this in databricks, I can change it. I did not write using databricks as I need to output 1 file and control the filename.

Solution

There is a way you can follow to compress the JSON file before uploading to blob storage.

Here is code for converting the data to JSON and Convert to binary code(utf-8) and lastly compress it.

Would suggest you add this code before uploading function.

import json
import gzip

def compress_data(data):
    # Convert to JSON
    json_data = json.dumps(data, indent=2)
    # Convert to bytes
    encoded = json_data.encode('utf-8')
    # Compress
    compressed = gzip.compress(encoded)

Reference: https://gist.github.com/LouisAmon/4bd79b8ab80d3851601f3f9016300ac4#file-json_to_gzip-py