Determine the optimized number of forests for a MarkLogic content database

ML content database normally quite big and is configured using ML cluster to spread over multiple nodes (hosts).

How many forests per host? Based on the diagram from below offical clustering guide, one could see sometimes one host may have two or more forests or simply one forest per host. Is the number related with the number of CPU cores for the host?

How to determine the optimzed number of forests for a content database?

Solution

There is no hard-and-fast rule regarding forest size and counts, as there are a number of factors regarding content, usage patterns, and SLAs that might influence things.

There are some general guidelines though: https://docs.marklogic.com/guide/cluster/scalability#id_96443

As your content grows in size, you might need to add forests to your database. There is no limit to the number of forests in a database, but there are some guidelines for individual forest sizes where, if the guidelines are greatly exceeded, then you might see performance degradation.

The numbers in these guidelines are not exact, and they can vary considerably based on the content. Rather, they are approximate, rule-of-thumb sizes. These numbers are based on average sized fragments of 10k to 100k. If your fragments are much larger on average, or if you have a lot of large binary documents, then the forests can probably be larger before running into any performance degradation.

The rule-of-thumb maximum size for a forest is 512GB. Each forest should ideally have two vCPUs of processing power available on its host, with 8GB memory per vCPU. For example, a host with eight vCPUs and 64GB memory can manage four 512GB forests. For bare-metal systems, a hardware thread (hyperthread), is equivalent to a vCPU. It is a good idea to run performance tests with your own workload and content. If you have many configured indexes you may need more memory. Memory requirements may also increase over time as projects evolve and forests grow with more content and more indexes.

If you have a small content database, it may not make sense to have a forest per host. A single forest may be fine until you get to a certain size, or unless you are looking to achieve certain performance benchmarks. If you expect the system to grow significantly or have high throughput demands, it can be helpful to spread the load amongst multiple D-nodes.

Other factors that might influence the number of forests would be if you have HA replicas and want to ensure that in the event of a failover during peak loads, the failover host doesn't double it's load when it opens up the HA replica forest. Having more forests per host, and striping HA replicas so that in the event of a host failure, it's load is spread to two failover hosts that take on half the load might provide better resiliency and smoother performance. But that would largely depend upon the types of workloads and how hard those D-nodes get pushed.

If you can perform performance testing with realistic data and usage scenarios and are monitoring the resource consumption and have SLAs for response times, you can more easily experiment and determine what the thresholds are and when a change to the number of forests and D-nodes would be helpful.