I'm defining a copy activity pipeline in Azure DataFactory. The copy is from a parquet file in azure blobstorage to a table in Azure Synapse Analytics (former ADW).
The sink table in Synapse is always a new one, as it is defined dynamically by a parameter in the pipeline. For example: the sink table name is "[schema].[some_prefix_{pipeline_string_parameter}]
".
Because of that, the sink table doesn't exist, it has to be created. To solve this, I'm using the "auto create table" option.
However, I also need to specify a specific distribution and persistence type. In my case, I want to define a hash distributed, heap table. How can I specify those configurations to the auto create table? As I understand, the default configs are CCI, round-robin, which I don't want.
As you mentioned and as per the Documentation,
Auto create table option in copy activity sink only gives default distribution i,e.
ROUND_ROBIN
.
You can raise a feature request about supporting other distributions here.
You can try the below possible methods as workarounds.
If your sink table schema is same for every time and if you know the schema, then create table before copying the data. Give your desired distributions in the create table query. For executing the query before copy, you can use Script activity before copy activity (or) pre-copy script in the copy activity sink.
Pass your table name to the script from the pipeline parameter using ADF dynamic content or string interpolation.
If your sink table schema is different every time, then after copy activity, create a duplicate table for your table with your desired distribution using CTAS
. After creating duplicate table, drop your original table and rename it as original table. Go through the script provided in this SO answer by @wBob to do that.
Here, you need to use the Script activity after copy activity to execute the script.