Search code examples
pysparkazure-data-lake-gen2

Output file name to final folder in pyspark


I want to write the data in output folder without the standard format of spark:

enter image description here

Is any way to output the data with only a specific file name and extension (json)?

Thanks in advance for any help!


Solution

  • No, there isn’t. It’s not the intended use case of Spark to bring everything to one partition and then write that. To keep consistent behaviour, the DataFrame’s number of partitions is ignored when writing a dataset, a folder is always created, with each file in that folder being related to the partition being processed.

    However, if you know the driver can hold the partition, then you could use standard Python functionality:

    import json
    data = [row.asDict() for row in dataframe.collect()]
    with open("name_of_file.json", "w") as fh:
        json.dump(obj=data, fp=fh)
    

    Note that in this case you won’t get the JSONlines format though, but there are ways around that too.