I have around 613 text files stored in azure data lake gen 2 at this path for eg '/rawdata/no=/.txt'. I want to read all the text files and unbase 64 all text files as they are base64 encoded. But when i run my code the code reads 600 text files but gives output as 20 files only. I am not able to understand why am i getting only 20 output files?
This is my code
# Function to decode base64 encoded data
def decode_base64(line):
decoded_bytes = base64.b64decode(line)
return decoded_bytes
input_data = spark.read.text('/rawdata/no=*/*.txt')
# Apply decoding function to each line
decoded_data = input_data.rdd.map(lambda row: decode_base64(row.value))
# Generate timestamp
timestamp = datetime.now().strftime("%Y%m%d%H%M")
# Write decoded data to output file partitioned by timestamp
output_folder_path = f"{output_file_name}_ss3_{timestamp}/"
decoded_data.saveAsPickleFile(output_folder_path) ## rows is total parition count of rows
What am i exepcting?
That is because of the partition you have, based on that you will get files.
Use below code while saving.
output_folder_path = f"{output_file_name}_ss3/"
decoded_data.repartition(len(decoded_data.collect())).saveAsPickleFile(output_folder_path)
Output:
In my case 5 text files read and written decoded data to 5 files.