Search code examples
azureazure-data-factoryparquetpartitioningparquet-dataset

partitioning a Parquet file in Data Factory


I am doing my project in datafactory and I need to save information in a recurrent way in the same parket file. Every certain period of time there is an update of the information and I would like it to be added to the parquet as a partition of the parquet. I have looked for how to do this in datafactory but have not found how to do it. Has anyone already done something similar with datafactory? Does datafactory have the option to partition parket files? I can't use azure fuction or databricks, only datafactory.

What I am doing is to generate two files, one with collected information and the other with the new information, I join them through a copy activity, creating a new parquet file with the updated information and I delete the two initial files. I do this every time there is a data update. But I would like to know if it is possible to do a different process to partition the parquet file each time there is an update.


Solution

  • If you want each update to be in its own partition:

    On your sink dataset, set the file name to include time: sink dataset In this example, the name of each partition would be the time it is created:

    @concat(
        formatDateTime(utcnow(),'yyyyMMddHHmmss'),
        '.parquet'
    )
    

    The new rows will be written to a new partition on each run. When you read the folder, you would get the entire data.

    sink