Search code examples
marklogicmarklogic-9mlcpmarklogic-dhf

MarkLogic - Ingest and Harmonize performance tuning


I have a 500 MB aggregate XML file that is taking 30 sec for mlcp ingest (approx 80,000 documents) and around 6 minutes for harmonization (converts each XML document to JSON before loading to FINAL DB).

The harmonization job follows the regular data-hub pattern (collector, content, writer etc.)

I have 50 such files to process and looking at ways to optimize the run time.

1) Is there a way I can kick off mlcp load and harmonize in parallel for multiple files (in the same job)?

2) In the harmonize job, I tried using -PbatchSize and -PthreadCount parameters but they have no impact beyond 500 for batch size and 6 thread counts. How could I improve performance by increasing these two values? Any server level settings required? Are there any other parameters that could help improve performance?

3) Any other alternatives to improve the performance of harmonize step?

Thanks in Advance!


Solution

  • Regarding 1)

    You can point MLCP for the input flow to a directory rather than just one file, and it should process all files in the subtree in one run. Once the input flow has finished, you can start the harmonize, and the collector of the harmonize should pick up all files that are available.

    However, if you would like to parallelize load, you should perhaps not load all in one run. Tweak your MLCP ingest to add an extra collection indicating some import number or just simply the filename of the aggregate file. Tweak your collector to take an (optional?) extra argument that trims down to that import number or the aggregate filename. You then run the import of one aggregate, and launch the harmonize for it once it finishes. Without waiting for this to complete, you do the same for a second aggregate. Dito for the remaining ones one by one.

    Regarding 2)

    Increasing the numbers isn't guaranteed going to increase speed. If harmonize is relatively heavy, you might be better off with smaller batch sizes, and smaller thread counts. Look at memory and cpu load. Increase only if they are below 90%. Increasing further won't help once you hit the roof. Scaling out horizontally (adding extra nodes to your cluster) would be the only solution in that case.

    Also keep IO speed in consideration. MarkLogic can only write as fast to disk as storage allows. More forests, and more nodes in a cluster that hold forests help there.

    Regarding 3)

    Consider profiling your harmonize code. The import sounds fairly quick. 80k docs in 30 sec is very decent, but the harmonize is much slower. Maybe there are some inefficient steps in there.

    Playing around with the suggestions I gave above might give you a feeling if there is room for improvement, but often the biggest gain can be found in the code itself.

    HTH!