Hi Iam new to nifi and I have followed the tutorial here to understand the provenance repository content and moving it out for auditing. But I have a couple of questions here.
The main use of provenance data is to make understand what exactly happened to a piece of data. But here the data is in flow file. How are we supposed to understand what happened to a particular data using flow file?
Is the best practice is to always send data provenance data from one nifi to another? Why not use the SiteToSiteProvenanceReportingTask to send to a port in the same nifi instance and extract it out of there?
What could be the best tools that can be used for sending these data for auditing?
Hopefully this answers your questions:
You can export the provenance data many ways, to extract the content of the flowfile from the provenance event, I believe you have to get at the "content claims" for the flowfile, not sure how that works. Because the content claims are reclaimed when no flowfile in the current system is using it, I don't think you can query on provenance events' content when the content no longer exists in the content repository. Some components will add an attribute for any errors/status they encounter.
You can certainly use a SiteToSiteProvenanceReportingTask to send provenance data from a cluster back to itself, you probably just want to filter out the Input Port and Process Group that handle the processing of provenance data.
Data provenance is sometimes a graph problem but the events are often useful on their own (without needing to know the flow, e.g.) so analysis can be done on the events themselves. I've sent the events to a Hive table and then was able to do some things with HiveQL like calculating predicted backpressure on connections (before we added it to NiFi proper)