Search code examples
apache-nifi

Apache Nifi - Data Provenance Options


Does anyone know whether it is possible to disable the storing of content, but keep basic metadata (attributes) within the provenance.

I don't need to store the content of the files flowing through my system, but would quite like to keep the metadata for error tracing purposes. Is there anyway of turning off just the flowfile content storage?

Otherwise what is the best practice way of disabling provenance all together? E.g., setting max storage time to 0sec, max storage size to 0bytes, using volatile with buffer size of 0. All I guess probably should work, any specifically suggested as the best way?


Solution

  • The Provenance Repository in NiFi stores history of the FlowFiles and not the real content of FlowFiles. Each time that an event occurs for a FlowFile a new provenance event is created. This provenance event is a snapshot of the FlowFile as it looked and fit in the flow that existed at that point in time.

    Where as the real content of the Flow Files is stored in the Content Repository. The Content Repository is simply a place in local storage where the content of all FlowFiles exist.

    After a FlowFile’s content is identified as no longer in use it will either be deleted or archived. If you want to delete the content immediately after a FlowFile is processed you can simply disable the content repository archival by setting nifi.content.repository.archive.enabled to false in nifi.properties

    However if you want to archive the content for some time so that provenance UI can view or replay content, you can enable the content repository archival and then use below properties to define when to delete the content:

    nifi.content.repository.archive.max.retention.period
    nifi.content.repository.archive.max.usage.percentage
    

    Read more about the Repositories in this document and about the related properties in this document.