Search code examples
azureazure-data-factory

Azure Data Factory: Exists transformation doesn't work as expected when comparing two tables


I am trying to compare the contents of two SQL tables in Azure Data Factory (ADF) using the Exists transformation in the data flow to find the delta data. My goal is to move the delta data to the sink. Here are the details:

  • Table 1 has 683,934 rows.
  • Table 2 has 684,000 rows.

Approach:

In my ADF data flow, for each table, I use the sha2() function to generate a derived hash column like this: derivedColumn = SHA2(256, columns())

The underlying business case is that the rows could be the same, but the data within the rows may change over time. Hence, I am comparing the entire content by calculating a hash column on all the columns instead of just comparing unique columns.

I then use the Exists transformation to find rows in Table 1 that do not exist in Table 2.

The delta data is supposed to be upserted to the sink.

Problem:

When I preview the data in the Exists transformation, I see a lot of rows as a result. In my case, it should return zero rows since Table 1 (left side table) has fewer rows than Table 2, and I know for a fact that all these rows and data are already present in Table 2.

Question:

  • Why is the Exists transformation returning so many rows? Am i missing something with regards to broadcasting or partitioning etc? I tried setting Broadcast to Auto as well as Fixed with Right side Stream, but had the same result in both cases.
  • Is there a more effective way to compare the contents of two tables to find delta data using ADF?
  • How can I ensure that the comparison correctly identifies only the changed or new rows?

Additional Information:

  • Both tables have the same schema.

Any insights or suggestions would be highly appreciated. Thank you!

Screenshots of the dataflow:

enter image description here

enter image description here


Solution

  • This issue in dataflow data preview is not actual error. The reason for data mismatch in data preview is because, in debug settings, source1 and source2 are limited to random 1000 rows for data preview. This is the reason for not getting the expected result. When you debug the pipeline with this dataflow, you will not get this error.

    • In order to get correct output in data preview, you can upload the sample data for source1 and source2 in debug settings.

    Refer MS doc regarding this debug settings.