Search code examples
azureazure-data-factory

Azure Data Factory - Unable to preview data


I have a ADF pipeline that uses basic transformations in a data flow. As such the data flow is pretty straightforward, nothing much complex going on here. Only a couple of lookups and derived columns.

It was working fine until earlier today but since afternoon the preview of output data at each step in timing out. One of the lookup tables has about 7 million records but it was still showing up fine until earlier today. I changed the row limits in the Debug settings but still the output data preview is timing out, without any results. Even the data preview of the derived columns is timing out.

I thought of running the pipeline anyway to see what gets loaded into the final table. But that failed because apparently the derived column is getting some NULL values and is passing that on to to a NOT NULL field in the SINK's table structure. So, now I really need to preview the step output.

The data preview is taking really long and then just errors out with a time out error. Please suggest if there anyway to change the performance of the ADF.


Solution

  • Without much details into your setup, it is hard to answer accurately.

    However, as explained here in MS doc for Internal server errors

    Successful execution of data flows depends on many factors, including the compute size/type, numbers of source/sinks to process, the partition specification, transformations involved, sizes of datasets, the data skewness and so on.

    Further looking at troubleshooting tips here

    There maybe a high number of null values or missing values which may be caused by having too few rows sampled. Try updating the debug row limit and refreshing the data.

    For more guidance, see Integration Runtime performance.

    From the doc, Recommendation:

    Go to Debug Settings, increase the number of rows in the source row limit. Select an Azure IR that has a data flow cluster that's large enough to handle more data.

    Even though integration runtime has compute resources elastically allocated, you can manually setup a new Integration runtime with higher capacity.

    enter image description here

    Also, since you mention of millions of records, when you have a pipeline with data flows executing in parallel or data flows that need to be tested with large datasets, choose "Use Activity Runtime" so that the service can use the Integration Runtime that you've selected in your data flow activity. This will allow the data flows to execute on multiple clusters and can accommodate your parallel data flow executions.

    enter image description here