Search code examples
pythonpysparkpicklerddazure-synapse

Save text files as binary format using saveAsPickleFile with pyspark


I have around 613 text files stored in azure data lake gen 2 at this path for eg '/rawdata/no=/.txt'. I want to read all the text files and unbase 64 all text files as they are base64 encoded. But when i run my code the code reads 600 text files but gives output as 20 files only. I am not able to understand why am i getting only 20 output files?

This is my code

# Function to decode base64 encoded data
def decode_base64(line):
    decoded_bytes = base64.b64decode(line)
    return decoded_bytes

input_data =  spark.read.text('/rawdata/no=*/*.txt')

    # Apply decoding function to each line
decoded_data = input_data.rdd.map(lambda row: decode_base64(row.value))


    # Generate timestamp
timestamp = datetime.now().strftime("%Y%m%d%H%M")

    # Write decoded data to output file partitioned by timestamp
output_folder_path = f"{output_file_name}_ss3_{timestamp}/"
decoded_data.saveAsPickleFile(output_folder_path) ## rows is total parition count of rows
  1. So i read the text files using spark text api. then decode them using unbase64. Then save the rdd as picklefile.
  2. As seen from image also that all 613 text files are being read but the output is only 20 files.

What am i exepcting?

  1. I want to read all text files and decode each files seprately and store them separately.

Solution

  • That is because of the partition you have, based on that you will get files.

    Use below code while saving.

    output_folder_path =  f"{output_file_name}_ss3/"
    decoded_data.repartition(len(decoded_data.collect())).saveAsPickleFile(output_folder_path)
    

    Output:

    In my case 5 text files read and written decoded data to 5 files.

    enter image description here