Search code examples
azureazure-data-factoryazure-cosmosdb

Azure Data Factory log the skipped rows as well


I have a Teradata database from which we gather data and import it into the Azure Storage account BLOB as a .gzip file (via the "Copy data" activity). Then our dataflow exports them to our Cosmos DB. The problem is that many of the rows contain invalid characters or the data is missing altogether.

This is why I added a Filter step, like this: enter image description here

It checks for specific symbols that only should be allowed to be exported to our Cosmos DB. However, this approach hides the invalid rows, which could cause issues in the future. What if the Filter doesn't actually work as we think it does and we filter out some of the valid data as well? This is why I'm looking for a way to log the skipped rows (the "invalid data").

Question: Is there a way to somehow log/store the skipped rows as well?

I was thinking of adding an If activity before the import "Copy data" activity, to separate valid and invalid data into two different .gzip files, but I couldn't find a way to do it. As I understand, the Lookup activity supports only up to 5,000 rows, however, we have millions of rows... The "Copy data" out-of-the-box logging option also looks quite limited.


Solution

  • Instead of filter transformation, you can use conditional split transformation in mapping data flow to separate valid and invalid data based on condition.

    enter image description here

    After this you can store the valid data in Cosmos db and invalid data in Blob storage based on your requirement.

    Reference: https://learn.microsoft.com/en-us/azure/data-factory/data-flow-conditional-split