Search code examples
apache-nifi

Nifi: MergeRecord doesn't wait and group up json files to one batch


I met the problem with Apache NiFi. I have about 100.000k+ json files looks like:

[ {
  "client_customer_id" : 8385419410,
  "campaign_id" : "11597209433",
  "resourceName" : "customers/8385419410/adGroupAds/118322191652~479093457035",
  "campaign" : "11597209433",
  "clicks" : "0",
  "topImpressionPercentage" : 1,
  "videoViews" : "0",
  "conversionsValue" : 0,
  "conversions" : 0,
  "costMicros" : "0",
  "ctr" : 0,
  "currentModelAttributedConversions" : 0,
  "currentModelAttributedConversionsValue" : 0,
  "engagements" : "0",
  "absoluteTopImpressionPercentage" : 1,
  "activeViewImpressions" : "0",
  "activeViewMeasurability" : 0,
  "activeViewMeasurableCostMicros" : "0",
  "activeViewMeasurableImpressions" : "0",
  "allConversionsValue" : 0,
  "allConversions" : 0,
  "averageCpm" : 0,
  "gmailForwards" : "0",
  "gmailSaves" : "0",
  "gmailSecondaryClicks" : "0",
  "impressions" : "2",
  "interactionRate" : 0,
  "interactions" : "0",
  "status" : "ENABLED",
  "ad.resourceName" : "customers/8385419410/ads/479093457035",
  "ad.id" : "479093457035",
  "adGroup" : "customers/8385419410/adGroups/118322191652",
  "device" : "DESKTOP",
  "date" : "2020-11-25"
} ]

Before saving it to database one by one, i want to create batch with 1,000-10,000 elements in one json and then save it to DB to increase speed. MergeRecord settings: enter image description here

What did i expect: MergeRecord waiting some time to group up json to create batch with 1000-10000 elements in one json, and then send this batch to PutDatabaseRecord processor.

Actual behaviour: MergeRecord instantly sending json's to PutDatabaseRecord one by one without grouping and joining them. 1/10 flows files will contain several json files as one file, as u can see on screenshot by their size. But seems like these settings of processor don't apply to all files: enter image description here

I don't understand where's the problem. MergeRecord settings or json files? This is really slow behaviour and my data (1.5 Gb) will be stored in 1 day probably.


Solution

  • The only way I could replicate this was to use a random table.name for each of the flow files, which would cause each file to be in it's own bin, rapidly overfilling your "Maximum Number of Bins", and causing each file to be sent as a separate flow file. If you have more than 10 tables, I would increase that setting.

    My only other suggestion would be to play around with the Run Schedule and Run Duration of the MergeRecord Processor (on the scheduling tab). If you set the run schedule to 2 minutes (for example), the processor will run once every two minutes and try to merge as many of the files in the queue as it can.