I am working in databricks, and have a Pyspark dataframe that I am converting to pandas, and then to a json lines file and want to upload it to an Azure container (ADLS gen2). The file is large, and I wanted to compress it prior to uploading.
I am first converting the pyspark dataframe to pandas.
pandas_df = df.select("*").toPandas()
Then converting it to a newline delimited json:
json_lines_data = pandas_df.to_json(orient='records', lines=True)
Then writing to blob storage with the following function:
def upload_blob(json_lines_data, connection_string, container_name, blob_name):
blob_service_client = BlobServiceClient.from_connection_string(connection_string)
blob_client = blob_service_client.get_blob_client(container=container_name, blob=blob_name)
try:
blob_client.get_blob_properties()
blob_client.delete_blob()
# except if no delete necessary
except:
pass
blob_client.upload_blob(json_lines_data)
This is working fine, but the data is around 3 GB per file, and takes a long time to download so I would rather zip the files. Can anyone here help with how to compress the json lines file and upload it to the azure container? I have tried a lot of different things, and nothing is working.
If there is a better way to do this in databricks, I can change it. I did not write using databricks as I need to output 1 file and control the filename.
There is a way you can follow to compress the JSON file before uploading to blob storage.
Here is code for converting the data to JSON and Convert to binary code(utf-8) and lastly compress it.
Would suggest you add this code before uploading function.
import json
import gzip
def compress_data(data):
# Convert to JSON
json_data = json.dumps(data, indent=2)
# Convert to bytes
encoded = json_data.encode('utf-8')
# Compress
compressed = gzip.compress(encoded)
Reference: https://gist.github.com/LouisAmon/4bd79b8ab80d3851601f3f9016300ac4#file-json_to_gzip-py