Search code examples
springspring-bootelasticsearchspring-dataspring-data-elasticsearch

Spring Boot with multiple Elasticsearch Cluster starts up very slow


I have Spring Boot App with three Elasticsearch clusters (ES v6.4.2) configured. The application.properties file looks like the following (I have three master nodes configured for every cluster but display one here for simplicity):

# Cluster 1
spring.data.elasticsearch.cluster-one.cluster-name=<cluster-1-name>
spring.data.elasticsearch.cluster-one.cluster-nodes=<ip-cluster-1-master-node>:9300

# Cluster 2 
spring.data.elasticsearch.cluster-two.cluster-name=<cluster-2-name>
spring.data.elasticsearch.cluster-two.cluster-nodes=<ip-cluster-2-master-node>:9300

# Cluster 3 
spring.data.elasticsearch.cluster-three.cluster-name=<cluster-3-name>
spring.data.elasticsearch.cluster-three.cluster-nodes=<ip-cluster-3-master-node>:9300

spring.data.elasticsearch.repositories.enabled=true

spring.autoconfigure.exclude = org.springframework.boot.autoconfigure.data.elasticsearch.ElasticsearchAutoConfiguration,org.springframework.boot.autoconfigure.data.elasticsearch.ElasticsearchDataAutoConfiguration

For every cluster I have a separate configuration class where I setup the TransportClient and the ElasticsearchTemplate.

Now when I startup the app locally with all three clusters running on my local machine, the app starts up normal. But when I deploy the app to my test environment using three separate remote clusters the startup process takes 20 minutes. It hangs on loading the Elasticsearch plugins for the third cluster it seems. Here an excerpt from the log output:

2019-09-10 00:55:57.607  INFO 27505 --- [           main] o.s.web.context.ContextLoader            : Root WebApplicationContext: initialization completed in 2897 ms
2019-09-10 00:55:57.971  INFO 27505 --- [           main] o.elasticsearch.plugins.PluginsService   : no modules loaded
2019-09-10 00:55:57.972  INFO 27505 --- [           main] o.elasticsearch.plugins.PluginsService   : loaded plugin [org.elasticsearch.index.reindex.ReindexPlugin]
2019-09-10 00:55:57.972  INFO 27505 --- [           main] o.elasticsearch.plugins.PluginsService   : loaded plugin [org.elasticsearch.join.ParentJoinPlugin]
2019-09-10 00:55:57.972  INFO 27505 --- [           main] o.elasticsearch.plugins.PluginsService   : loaded plugin [org.elasticsearch.percolator.PercolatorPlugin]
2019-09-10 00:55:57.972  INFO 27505 --- [           main] o.elasticsearch.plugins.PluginsService   : loaded plugin [org.elasticsearch.script.mustache.MustachePlugin]
2019-09-10 00:55:57.973  INFO 27505 --- [           main] o.elasticsearch.plugins.PluginsService   : loaded plugin [org.elasticsearch.transport.Netty4Plugin]
2019-09-10 00:55:59.785  INFO 27505 --- [           main] o.elasticsearch.plugins.PluginsService   : no modules loaded
2019-09-10 00:55:59.785  INFO 27505 --- [           main] o.elasticsearch.plugins.PluginsService   : loaded plugin [org.elasticsearch.index.reindex.ReindexPlugin]
2019-09-10 00:55:59.785  INFO 27505 --- [           main] o.elasticsearch.plugins.PluginsService   : loaded plugin [org.elasticsearch.join.ParentJoinPlugin]
2019-09-10 00:55:59.785  INFO 27505 --- [           main] o.elasticsearch.plugins.PluginsService   : loaded plugin [org.elasticsearch.percolator.PercolatorPlugin]
2019-09-10 00:55:59.785  INFO 27505 --- [           main] o.elasticsearch.plugins.PluginsService   : loaded plugin [org.elasticsearch.script.mustache.MustachePlugin]
2019-09-10 00:55:59.786  INFO 27505 --- [           main] o.elasticsearch.plugins.PluginsService   : loaded plugin [org.elasticsearch.transport.Netty4Plugin]
2019-09-10 01:18:30.484  INFO 27505 --- [           main] o.elasticsearch.plugins.PluginsService   : no modules loaded
2019-09-10 01:18:30.485  INFO 27505 --- [           main] o.elasticsearch.plugins.PluginsService   : loaded plugin [org.elasticsearch.index.reindex.ReindexPlugin]
2019-09-10 01:18:30.485  INFO 27505 --- [           main] o.elasticsearch.plugins.PluginsService   : loaded plugin [org.elasticsearch.join.ParentJoinPlugin]
2019-09-10 01:18:30.485  INFO 27505 --- [           main] o.elasticsearch.plugins.PluginsService   : loaded plugin [org.elasticsearch.percolator.PercolatorPlugin]
2019-09-10 01:18:30.485  INFO 27505 --- [           main] o.elasticsearch.plugins.PluginsService   : loaded plugin [org.elasticsearch.script.mustache.MustachePlugin]
2019-09-10 01:18:30.485  INFO 27505 --- [           main] o.elasticsearch.plugins.PluginsService   : loaded plugin [org.elasticsearch.transport.Netty4Plugin]

Here you see the delay of more than 20 minutes between the second and third block loading the plugins.

When curling the clusters from the test environment they are all reachable and respond without delay.

What can be the reason for the delay or where do I have to look?

Is it possible or maybe recommended to load the Elasticsearch plugins only once for all three clusters and if yes, how can I achieve this?

EDIT:

DEBUG logs show me, that the master nodes can't connect to the data nodes:

org.elasticsearch.transport.ConnectTransportException: [data_node_6][<ip-of-data-node>:9300] connect_exception
[...]

2019-09-10 18:49:00.517 DEBUG 26219 --- [main] o.e.c.t.TransportClientNodesService      : failed to connect to discovered node [{data_node_6}{LKdxInfLSyqrGgSOXvTwFw}{YIhin3kpSNupEY1jBlHFVg}{<ip-of-data-node>}{<ip-of-data-node>:9300}{ml.machine_memory=33422729216, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}]

But my Cluster is online and in green state with all data nodes present. All nodes are configured as a mesh VPN with ports 9200 and 9300 open for communication between the nodes.

Does ES need another port to be open for communication?


Solution

  • The problem occured because I enabled cluster sniffing (https://www.elastic.co/guide/en/elasticsearch/client/java-api/current/transport-client.html) which picks up all data nodes in the cluster and communicates directly with them instead via the master nodes.

    Since my cluster is configured as a VPN and only the master nodes can be reached from the backend (which is outside the VPN), the backend cannot communicate with the data nodes when gets the internal VPN IPs (which are not public IPs) from the master nodes, hence the connection failure.

    So I disabled the cluster sniffing and everything works now as expected.