Search code examples
elasticsearchelasticsearch-plugin

Elasticsearch 2.2.0 High CPU & High IOPS on StressTest


I recently have upgraded my ES from 1.5.2 to 2.2.0 version and add Shield to it. I`m trying to perform a stress test by using Locust that blast the cluster with data (by nodejs app). I got strange results comparing to the previous stress test (on 1.5.2):

        1.5.2 ver             2.2.0 ver

cpu     50% avg, 90% peak     87% avg, 96% peak

IOPS    30 avg, 300 peak      800 avg, 1122 peak

Why ES working so hard?

Another strange thing that I cant understand, and I think is connected to the above, is the output in plugin head. Previously (1.5.2) I saw indices store data as:

Index_name

size: 10.3Gi (20.6Gi)

docs: 17,073,010 (17,073,010)

But now (2.2.0) it is as:

Index_name

size: 13.7Gi (29.3Gi)

docs: 10,217,220 (20,434,440)

As you can see, the data double itself in ES 2.2.0, why it is happening? There something wrong with my v2.2.0 ES configurations?


Solution

  • I got my answer in the Elasticsearch community forum.

    Zachary Tong's answer :

    Agreeing with the points @rusty raised: Doc values on by default adds some CPU/IO overhead and some more disk space, translog flushes on every action now (instead of every 5s) and the replica issue.

    In addition to that, there was a change at the Lucene layer. Incoming blob of text, but the tl;dr is that Lucene identifies idle resources and utilizes them, making the resource usage look higher when it's really just getting work done faster.

    So, in Elasticsearch 1.x, we forcefully throttled Lucene's segment merging process to prevent it from over-saturating your nodes/cluster.

    The problem is that a strict threshold is almost never the right answer. If you are indexing heavily, you often want to increase the threshold to let Lucene use all your CPU and Disk IO. If you aren't indexing much, you likely want the threshold lower. But you also want it to be able to "burst" the limit for one-off merges when your cluster is relatively idle.

    In Lucene 5.x (used in ES 2.0+), they added a new style of merge throttling that monitors how active the index is, and automatically adjusts the throttle threshold (see https://issues.apache.org/jira/browse/LUCENE-61191, https://github.com/elastic/elasticsearch/pull/92431 and https://github.com/elastic/elasticsearch/pull/91451).

    In practice, what this means is that your indexing tends to be faster in ES 2.0+ because segments are allowed to merge as fast as your cluster can handle, without over-saturating your cluster. But it also means that your cluster will happily use any idle resources, which is why you see more resource utilization.

    Basically, Lucene identified that those resources weren't being used...so it put them to work to finish the task faster.