I am trying to build a custom docker image for OpenSearch based on the official image from the opensearch procect (version 2.16.0).
My goal is to have an image for development and QA purposes that, when deployed, supplies me with a single-node OpenSearch cluster with a number of test indices and no authentication. Data should be baked into the image.
To that end, I take the official image, and during a docker build
process, start the OpenSearch cluster and push the data for my indices into OpenSearch using elasticdump
. I have done the same successfully for Elasticsearch in the past.
Here is a simplified version of my Dockerfile:
FROM opensearchproject/opensearch:2.16.0
USER root
ENV NODE_TLS_REJECT_UNAUTHORIZED=0
COPY test_mapping.json /etc/test_mapping.json
COPY test_data.json /etc/test_data.json
# procps needed for opensearch --daemonize
RUN yum -y install nodejs npm procps
RUN npm install elasticdump -g
# start as single node cluster with security disabled so that we need not setup certificates
RUN printf 'discovery.type: single-node\nplugins.security.disabled: true\n' >> /usr/share/opensearch/config/opensearch.yml
# opensearch cannot be run as root
USER opensearch
RUN opensearch --daemonize -p /tmp/opensearch-pid \
# wait for opensearch cluster to start up
&& sleep 30s \
&& elasticdump \
--input=/etc/test_mapping.json \
--output=http://localhost:9200/test_index \
--type=mapping \
&& elasticdump \
--input=/etc/test_data.json \
--output=http://localhost:9200/test_index \
--type=data \
--limit=250 \
--concurrencyInterval=2000 \
# terminate opensearch gracefully
&& pkill -f /tmp/opensearch-pid
The build runs smoothly and without errors and creates a docker image. The problem occurs when I try to start a container from that image:
docker run --rm --name testopensearch -p 9200:9200 -p 9600:9600 -d mycustomopensearch:0.0.1
The endpoint doesn't come up and docker logs testopensearch
shows that opensearch was started but then failed with an org.apache.lucene.store.AlreadyClosedException
:
...
[2024-09-18T13:25:14,668][INFO ][o.o.e.NodeEnvironment ] [215c1baf6b02] using [1] data paths, mounts [[/ (overlay)]], net usable_space [364.7gb], net total_space [914.6gb], types [overlay]
[2024-09-18T13:25:14,668][INFO ][o.o.e.NodeEnvironment ] [215c1baf6b02] heap size [1gb], compressed ordinary object pointers [true]
[2024-09-18T13:25:14,670][ERROR][o.o.b.OpenSearchUncaughtExceptionHandler] [215c1baf6b02] uncaught exception in thread [main]
org.opensearch.bootstrap.StartupException: org.apache.lucene.store.AlreadyClosedException: Underlying file changed by an external force at 2024-09-18T13:25:14.658508066Z, (lock=NativeFSLock(path=/usr/share/opensearch/data/nodes/0/node.lock,impl=sun.nio.ch.FileLockImpl[0:9223372036854775807 exclusive valid],creationTime=2024-09-18T13:22:22.653718919Z))
at org.opensearch.bootstrap.OpenSearch.init(OpenSearch.java:185) ~[opensearch-2.16.0.jar:2.16.0]
at org.opensearch.bootstrap.OpenSearch.execute(OpenSearch.java:172) ~[opensearch-2.16.0.jar:2.16.0]
at org.opensearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:104) ~[opensearch-2.16.0.jar:2.16.0]
at org.opensearch.cli.Command.mainWithoutErrorHandling(Command.java:138) ~[opensearch-cli-2.16.0.jar:2.16.0]
at org.opensearch.cli.Command.main(Command.java:101) ~[opensearch-cli-2.16.0.jar:2.16.0]
at org.opensearch.bootstrap.OpenSearch.main(OpenSearch.java:138) ~[opensearch-2.16.0.jar:2.16.0]
at org.opensearch.bootstrap.OpenSearch.main(OpenSearch.java:104) ~[opensearch-2.16.0.jar:2.16.0]
Caused by: org.apache.lucene.store.AlreadyClosedException: Underlying file changed by an external force at 2024-09-18T13:25:14.658508066Z, (lock=NativeFSLock(path=/usr/share/opensearch/data/nodes/0/node.lock,impl=sun.nio.ch.FileLockImpl[0:9223372036854775807 exclusive valid],creationTime=2024-09-18T13:22:22.653718919Z))
at org.apache.lucene.store.NativeFSLockFactory$NativeFSLock.ensureValid(NativeFSLockFactory.java:179) ~[lucene-core-9.11.1.jar:9.11.1 0c087dfdd10e0f6f3f6faecc6af4415e671a9e69 - 2024-06-23 12:31:02]
at org.opensearch.env.NodeEnvironment.assertEnvIsLocked(NodeEnvironment.java:1149) ~[opensearch-2.16.0.jar:2.16.0]
at org.opensearch.env.NodeEnvironment.nodeDataPaths(NodeEnvironment.java:900) ~[opensearch-2.16.0.jar:2.16.0]
at org.opensearch.env.NodeEnvironment.assertCanWrite(NodeEnvironment.java:1373) ~[opensearch-2.16.0.jar:2.16.0]
at org.opensearch.env.NodeEnvironment.<init>(NodeEnvironment.java:376) ~[opensearch-2.16.0.jar:2.16.0]
at org.opensearch.env.NodeEnvironment.<init>(NodeEnvironment.java:301) ~[opensearch-2.16.0.jar:2.16.0]
at org.opensearch.node.Node.<init>(Node.java:550) ~[opensearch-2.16.0.jar:2.16.0]
at org.opensearch.node.Node.<init>(Node.java:432) ~[opensearch-2.16.0.jar:2.16.0]
at org.opensearch.bootstrap.Bootstrap$5.<init>(Bootstrap.java:242) ~[opensearch-2.16.0.jar:2.16.0]
at org.opensearch.bootstrap.Bootstrap.setup(Bootstrap.java:242) ~[opensearch-2.16.0.jar:2.16.0]
at org.opensearch.bootstrap.Bootstrap.init(Bootstrap.java:404) ~[opensearch-2.16.0.jar:2.16.0]
uncaught exception in thread [main]
at org.opensearch.bootstrap.OpenSearch.init(OpenSearch.java:181) ~[opensearch-2.16.0.jar:2.16.0]
... 6 more
org.apache.lucene.store.AlreadyClosedException: Underlying file changed by an external force at 2024-09-18T13:25:14.658508066Z, (lock=NativeFSLock(path=/usr/share/opensearch/data/nodes/0/node.lock,impl=sun.nio.ch.FileLockImpl[0:9223372036854775807 exclusive valid],creationTime=2024-09-18T13:22:22.653718919Z))
at org.apache.lucene.store.NativeFSLockFactory$NativeFSLock.ensureValid(NativeFSLockFactory.java:179)
at org.opensearch.env.NodeEnvironment.assertEnvIsLocked(NodeEnvironment.java:1149)
at org.opensearch.env.NodeEnvironment.nodeDataPaths(NodeEnvironment.java:900)
at org.opensearch.env.NodeEnvironment.assertCanWrite(NodeEnvironment.java:1373)
at org.opensearch.env.NodeEnvironment.<init>(NodeEnvironment.java:376)
at org.opensearch.env.NodeEnvironment.<init>(NodeEnvironment.java:301)
at org.opensearch.node.Node.<init>(Node.java:550)
at org.opensearch.node.Node.<init>(Node.java:432)
at org.opensearch.bootstrap.Bootstrap$5.<init>(Bootstrap.java:242)
at org.opensearch.bootstrap.Bootstrap.setup(Bootstrap.java:242)
at org.opensearch.bootstrap.Bootstrap.init(Bootstrap.java:404)
at org.opensearch.bootstrap.OpenSearch.init(OpenSearch.java:181)
at org.opensearch.bootstrap.OpenSearch.execute(OpenSearch.java:172)
at org.opensearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:104)
at org.opensearch.cli.Command.mainWithoutErrorHandling(Command.java:138)
at org.opensearch.cli.Command.main(Command.java:101)
at org.opensearch.bootstrap.OpenSearch.main(OpenSearch.java:138)
at org.opensearch.bootstrap.OpenSearch.main(OpenSearch.java:104)
For complete error details, refer to the log at /usr/share/opensearch/logs/docker-cluster.log
My theory is that the opensearch --daemonize
call during the build somehow does not terminate correctly and leaves something behind that causes this error. I've already tried adding additional && sleep 30s
after the pkill
to give opensearch more time to shutdown, but to no effect.
I tried to take a look at the logs, but contrary to what the error message says, there is no file /usr/share/opensearch/logs/docker-cluster.log
. I did not find the location of the logs of an opensearch -d
run.
I also tried to start the image and interactively run opensearch
from there:
docker run --rm --name testopensearch -p 9200:9200 -p 9600:9600 -it --entrypoint=/bin/bash mycustomopensearch:0.0.1
When I run opensearch -d
from within the container, I get the same error. Interestingly, when I then terminate it using pkill
and repeat the same commands again, I get the same error the second time, but the third time, opensearch
starts correctly and the cluster comes up with my indices in place and everything. This leaves me puzzled.
I also checked whether elasticdump
may be somehow involved, but this part works fine and is not the root of the problem. If we replace the elasticdump commands with a simple curl
to the opensearch root, the build also works fine and the problem afterwards is the same (so the following would be an even more minimal example in case someone would like to reproduce the build).
FROM opensearchproject/opensearch:2.16.0
USER root
ENV NODE_TLS_REJECT_UNAUTHORIZED=0
# procps needed for opensearch --daemonize
RUN yum -y install procps
# start as single node cluster with security disabled so that we need not setup certificates
RUN printf 'discovery.type: single-node\nplugins.security.disabled: true\n' >> /usr/share/opensearch/config/opensearch.yml
# opensearch cannot be run as root
USER opensearch
RUN opensearch --daemonize -p /tmp/opensearch-pid \
# wait for opensearch cluster to start up
&& sleep 30s \
&& curl http://localhost:9200 \
# terminate opensearch gracefully
&& pkill -f /tmp/opensearch-pid
I also checked whether the way opensearch
is started when deploying the image may play a role, but if I introduce an additional
RUN opensearch
after the first RUN
, this triggers the org.apache.lucene.store.AlreadyClosedException
during the build as well.
Can anyone point me to how I could terminate opensearch -d
during the build gracefully in a way that avoids this issue? Or is there a lock file I can delete manually or anything? Any help would be greatly appreciated.
Ok, I finally found a way. I manually deleted 2 lock files that seem to have been left during the build and that apparently caused the issues.
RUN rm /usr/share/opensearch/data/nodes/0/node.lock \
&& rm /usr/share/opensearch/data/nodes/0/_state/write.lock
Seems like a harsh solution to me, but I could not find a way to have opensearch -d
shut down gracefully. Hope there will be no side effects.
I also needed to switch the users because of access rights. Full working example Dockerfile
:
FROM opensearchproject/opensearch:2.16.0
USER root
ENV NODE_TLS_REJECT_UNAUTHORIZED=0
# procps needed for opensearch --daemonize
RUN yum -y install procps
# start as single node cluster with security disabled so that we need not setup certificates
RUN printf 'discovery.type: single-node\nplugins.security.disabled: true\n' >> /usr/share/opensearch/config/opensearch.yml
# opensearch cannot be run as root
USER opensearch
RUN opensearch --daemonize -p /tmp/opensearch-pid \
# wait for opensearch cluster to start up
&& sleep 30s \
&& curl http://localhost:9200 \
# terminate opensearch gracefully
&& pkill -f /tmp/opensearch-pid
USER root
# dirty workaround to resolve problem where opensearch won't start because it did not shutdown cleanly before
RUN rm /usr/share/opensearch/data/nodes/0/node.lock \
&& rm /usr/share/opensearch/data/nodes/0/_state/write.lock
USER opensearch
Still wondering if there is something else I might have done though, so if anyone knows, feel free to comment or answer...