performance elasticsearch query-optimization elastic-stack database-performance

Elastic search, experiencing very low search speed

We have a cluster consisting of 3 masters (4 core, 16 GB RAM each), 3 hot(8 core, 32 GB RAM, 300 GB SSD each), and 3 warm nodes(8 core, 32GB RAM, 1.5TB HDD each).

We have one index for each month of year following the naming convention of voucher_YYYY_MMM(eg voucher_2021_JAN). and all these indexes have an alias voucher which acts as a read alias and our search query is directed towards this read alias.

Our index resides on the hot nodes for 32 days, and that is the period it will receive 99% of writes. Our estimate data is approximately 480 million docs in this index, it has 1 replica and 16 shards( we have taken 16 shards because eventually, our data will grow, right now we are thinking of shrinking down to 8 shards each shard with 30 GB of data, as per our mapping 2 million docs are taking 1GB of space).

After 32 days index will move to the warm nodes, currently, we have 450 million in our hot index and 1.8 billion documents collectively in our warm indexes. The total comes up to 2.25 billion docs.

Our doc contains customer id and some fields on which we are applying filters, they all are mapped as keyword types, we are using custom routing on customer id for improving our search speed.

our typical query looks like

GET voucher/_search?routing=1000636779&search_type=query_then_fetch
{
  "from": 0,
  "size": 20,
  "query": {
    "constant_score": {
      "filter": {
        "bool": {
          "filter": [
            {
              "term": {
                "uId": {
                  "value": "1000636779",
                  "boost": 1
                }
              }
            },
            {
              "terms": {
                "isGift": [
                  "false"
                ]
              }
            }
          ]
        }
      }
    }
  },
  "version": true,
  "sort": [
    {
      "cdInf.crtdAt": {
        "order": "desc"
      }
    }
  ]
}

We are using a constant score query because we don't want to score our documents and want to increase search speed.

We have 13 search threads on each of our hot and warm nodes and we are sending requests to our master node for indexing and searching.

we are sending 100 search requests per second and getting an average search response time of about 3.5 seconds, where max time is going up to 9 seconds.

I am not understanding what are we missing, why is our search performance so poor.

Solution

Thank you for the exhaustive explanations. Based on them here are a few points of improvement (in no particular order):

Never direct your search and index requests to the master nodes, they should never handle traffic. Send them to the data nodes directly, or better yet, to dedicated coordinating nodes.
As a direct consequence, the master(-eligible) nodes don't need 16GB of RAM, 2GB is more than sufficient, because they will not act as coordinating nodes anymore.
In case you have time ranges in your queries, you could leverage index sorting on the cdInf.crtdAt field. Faster searches at the cost of slower ingestion, but it only makes sense if your queries have a time constraint, otherwise not.
16 shards per index on 3 hot nodes is not a good sharding strategy, you should have a multiple of the number of nodes (3, 6, 9, etc) otherwise one of the nodes will have more shards, and hence, you might create hot spots. You can also add one more hot node, so each one has 4 shards. It's a typical example of oversharding. Since your indexes are rolled over each month, it's easy to just modify the number of primary shards in the index template as you see data growing.
It's a good idea to leverage routing in order to search less shards. It's not clear from the question how many indexes in total you have behind the voucher alias, but that would also be a good information to have in order to assess whether the sharding and size of search threads is appropriate. Based on the docs count you provide, it seems you have 1 hot index and 5 warm ones, so 6 indexes in total. So each search request with routing will search only 6 shards.
100 search requests per second and 13 search threads per node (the default for 8 cores) means that each second each node has to handle 7+ search requests, and since requests take approximately 3 seconds to return, you're building up a search queue, because the nodes might not be able to keep up.
Another feature to leverage in order to benefit from the filter caching is the preference query string parameter
Also part of the slowness comes from the fact that 80% of the data you're searching on is located on warm nodes with spinning disks, so depending on your use case, you might want to maybe split your search in two, i.e. one super fast search on the hot data and another slower search on the warm data.
Once your indexes get reallocated to the warm nodes (and if they don't get updated anymore), it might be a good idea to force merge them to a few segments (3 to 5) so that your searches have less segments to browse and also to decrease their size (i.e. remove deleted documents)