Search code examples
apache-nifi

Nifi: MergeContent to create a ZIP archive of CSV files not working


I have a flow that fetches all files from a given directory, and they could be gzip, zip or csv, with the gzip and zip holding a single csv file. I then route on MIME type, decompress the gzip files, unpack the zip files, and then bring what are now ALL csv files back together. This is working. I then want to create a zip archive of all of these csv files, and MergeContent seemed like a good candidate (outside of using ExecuteStreamCommand to run zip from the OS).

But no matter what I do, the results are inconsistent:

  • The first time I ran it with default properties in MergeContent (and setting Merge Format to ZIP), it created a single zip file — I thought I was done!
  • But on the second time, it created six zip files, so I realized I wasn't done, and started playing with properties
  • Changed Maximum number of Bins to 1 - created five zip files
  • Changed Correlation Attribute Name to absolute.path, which is the same for all flow files - created four zip files
  • Changed Maximum Group Size to 500 MB (all csv files zipped up are ~500 KB) - created two zip files
  • Changed Minimum Number of Entries to 1000, Maximum to 5000 (62 csv files) - created forty zip files
  • Changed load balancing on the queue to single node - created twelve zip files

Clearly I'm throwing darts hoping something will stick. Documentation (and the general wisdom of the web) hasn't been particularly helpful in understanding how this works, what a "bin" is, what a "bundle" is, and most of it seems geared towards breaking apart a single flow file, doing some processing, then bringing it back together. That's not what I'm doing. I'm starting with multiple flow files and want to bring all of them always, every time, to a single flow file.

If the answer is that this can't be done with MergeContent, then I'll just run zip through the OS — but that would mean writing the files to disk, then zipping, and I wanted to try to keep this native Nifi.

Again, I started with default properties, except changing Merge Format to ZIP, and then made my modifications from there. And, yes, I am using the "merged" relationship.


Solution

  • As it turns out, the one property I did not play with, Max Bin Age, was the key to open the door. Matt's excellent explanation here gave tremendous insight, and provided the ultimate solution.

    Current config:

    Merge Strategy: Bin-Packing Algorithm
    Merge Format: ZIP
    Attribute Strategy: Keep Only Common Attributes
    Correlation Attribute Name: No value set
    Minimum Number of Entries: 500
    Maximum Number of Entries: 1000
    Minimum Group Size: 0 B
    Maximum Group Size: No value set
    Max Bin Age: 15 secs
    Maximum Number of Bins: 5
    Compression Level: 1
    Keep Path: false