We have a Mesos cluster and launches tasks by Marathon on Mesos-Slave with Docker container.
The whole system runs very well but a very strange problem occurred from time to time: when we try to destroy/re-deploy a task through Marathon, the mesos-slave got killed by the exiting of the target Docker container. This is the error log I got:
Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: I0229 19:31:51.465544 4094 docker.cpp:1592] Executor for container 'eadfb756-b653-42eb-977a-c16c78b1a7c5' has exited
Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: I0229 19:31:51.465736 4094 docker.cpp:1390] Destroying container 'eadfb756-b653-42eb-977a-c16c78b1a7c5'
Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: I0229 19:31:51.465812 4094 docker.cpp:1494] Running docker stop on container 'eadfb756-b653-42eb-977a-c16c78b1a7c5'
Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: I0229 19:31:51.466089 4098 slave.cpp:3440] Executor 'prod-xxxxxxx-data-collector-writer.6d832d68-d519-11e5-acca-00505692154c' of framework 8d26b713-c3cd-4e9b-956d-24f63b1320e0-0000 exited with status 0
Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: I0229 19:31:51.466167 4098 slave.cpp:3544] Cleaning up executor 'prod-xxxxxxx-data-collector-writer.6d832d68-d519-11e5-acca-00505692154c' of framework 8d26b713-c3cd-4e9b-956d-24f63b1320e0-0000
Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: F0229 19:31:51.470055 4098 slave.cpp:3570] CHECK_SOME(os::touch(path)): Failed to open file: No such file or directory
Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: *** Check failure stack trace: ***
Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: @ 0x7f8c3c2144dd google::LogMessage::Fail()
Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: @ 0x7f8c3c21621c google::LogMessage::SendToLog()
Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: I0229 19:31:51.566812 4099 docker.cpp:1592] Executor for container 'e2d9c750-88b7-4247-b696-6589665d6a66' has exited
Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: @ 0x7f8c3c2140cc google::LogMessage::Flush()
Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: I0229 19:31:51.569646 4099 docker.cpp:1390] Destroying container 'e2d9c750-88b7-4247-b696-6589665d6a66'
Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: I0229 19:31:51.569757 4099 docker.cpp:1592] Executor for container 'f51c68b8-c64d-47ea-a629-8516dcc90dba' has exited
Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: I0229 19:31:51.569787 4099 docker.cpp:1390] Destroying container 'f51c68b8-c64d-47ea-a629-8516dcc90dba'
Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: I0229 19:31:51.569818 4099 docker.cpp:1494] Running docker stop on container 'e2d9c750-88b7-4247-b696-6589665d6a66'
Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: I0229 19:31:51.569849 4099 docker.cpp:1494] Running docker stop on container 'f51c68b8-c64d-47ea-a629-8516dcc90dba'
Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: @ 0x7f8c3c216b19 google::LogMessageFatal::~LogMessageFatal()
Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: @ 0x7f8c3bc99f2e mesos::internal::slave::Slave::removeExecutor()
Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: @ 0x7f8c3bcaca60 mesos::internal::slave::Slave::executorTerminated()
Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: @ 0x7f8c3c1c6541 process::ProcessManager::resume()
Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: @ 0x7f8c3c1c683f process::internal::schedule()
Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: @ 0x7f8c3ad4a1e0 (unknown)
Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: @ 0x7f8c3afa3df5 start_thread
Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: @ 0x7f8c3a7b41ad __clone
Feb 29 19:31:51 mesos-slave3.ourcompany.com systemd[1]: mesos-slave.service: main process exited, code=killed, status=6/ABRT
Feb 29 19:31:51 mesos-slave3.ourcompany.com systemd[1]: Unit mesos-slave.service entered failed state.
Feb 29 19:32:11 mesos-slave3.ourcompany.com systemd[1]: mesos-slave.service holdoff time over, scheduling restart.
The task launched in Docker container is a AKKA application, and the environment info for the whole system is:
OS:
CentOS Linux release 7.1.1503 (Core)
Kernel:
3.10.0-229.el7.x86_64
JDK on all machine:
java version "1.7.0_91"
OpenJDK Runtime Environment (rhel-2.6.2.1.el7_1-x86_64 u91-b00)
OpenJDK 64-Bit Server VM (build 24.91-b01, mixed mode)
Mesos:
0.25, installed by yum from mesosphere repo
Mesos-Master config:
--zk=zk://zk-node1:2181,zk-node2:2181,zk-node3:2181,zk-node4:2181,zk-node5:2181/mesos-cluster --port=5050 --log_dir=/var/log/mesos --cluster=mesos-prod-cluster --hostname=<real hostname> --ip=<real ip> --quorum=3 --registry_fetch_timeout=5mins --work_dir=/var/lib/mesos
Mesos-Slave config:
--master=zk://zk-node1:2181,zk-node2:2181,zk-node3:2181,zk-node4:2181,zk-node5:2181/mesos-cluster --log_dir=/var/log/mesos --attributes=env:prod --containerizers=docker,mesos --docker_remove_delay=2weeks --executor_registration_timeout=30mins --hostname=<real slave hostname>
Marathon info:
{
"name": "marathon",
"version": "0.11.1",
"elected": true,
"leader": "<leader_ip>:8080",
"frameworkId": "8d26b713-c3cd-4e9b-956d-24f63b1320e0-0000",
"marathon_config": {
"master": "zk-node1:2181,zk-node2:2181,zk-node3:2181,zk-node4:2181,zk-node5:2181/mesos-cluster",
"failover_timeout": 604800,
"framework_name": "marathon",
"ha": true,
"checkpoint": true,
"local_port_min": 10000,
"local_port_max": 20000,
"executor": "//cmd",
"hostname": "<hostname>",
"webui_url": null,
"mesos_role": null,
"task_launch_timeout": 600000,
"reconciliation_initial_delay": 15000,
"reconciliation_interval": 300000,
"marathon_store_timeout": 2000,
"mesos_user": "root",
"leader_proxy_connection_timeout_ms": 5000,
"leader_proxy_read_timeout_ms": 10000,
"mesos_leader_ui_url": "http://<leader_ip>:5050/"
},
"zookeeper_config": {
"zk": "zk-node1:2181,zk-node2:2181,zk-node3:2181,zk-node4:2181,zk-node5:2181/marathon-cluster",
"zk_timeout": 10000,
"zk_session_timeout": 1800000,
"zk_max_versions": 25
},
"event_subscriber": {
"type": "http_callback",
"http_endpoints": null
},
"http_config": {
"assets_path": null,
"http_port": 8080,
"https_port": 8443
}
}
Docker version:
Client:
Version: 1.9.1
API version: 1.21
Go version: go1.4.2
Git commit: a34a1d5
Built: Fri Nov 20 13:25:01 UTC 2015
OS/Arch: linux/amd64
Server:
Version: 1.9.1
API version: 1.21
Go version: go1.4.2
Git commit: a34a1d5
Built: Fri Nov 20 13:25:01 UTC 2015
OS/Arch: linux/amd64
Docker info:
Containers: 330
Images: 509
Server Version: 1.9.1
Storage Driver: devicemapper
Pool Name: docker-253:0-68977907-pool
Pool Blocksize: 65.54 kB
Base Device Size: 107.4 GB
Backing Filesystem:
Data file: /dev/loop0
Metadata file: /dev/loop1
Data Space Used: 23.68 GB
Data Space Total: 107.4 GB
Data Space Available: 27.51 GB
Metadata Space Used: 63.75 MB
Metadata Space Total: 2.147 GB
Metadata Space Available: 2.084 GB
Udev Sync Supported: true
Deferred Removal Enabled: false
Deferred Deletion Enabled: false
Deferred Deleted Device Count: 0
Data loop file: /var/lib/docker/devicemapper/devicemapper/data
Metadata loop file: /var/lib/docker/devicemapper/devicemapper/metadata
Library Version: 1.02.93-RHEL7 (2015-01-28)
Execution Driver: native-0.2
Logging Driver: json-file
Kernel Version: 3.10.0-229.el7.x86_64
Operating System: CentOS Linux 7 (Core)
CPUs: 4
Total Memory: 15.67 GiB
Name: mesos-slave3.gz.yougola.com
ID: QB4G:C2HK:CBPR:G5ID:6OCU:DFEC:USBP:ECLQ:FWOQ:ZGHS:JIU5:JNN4
Services including Docker, Mesos-Master, Mesos-Slave, Marathon are all managed by systemd.
That is strange and unfortunate. Looks like it's failing this check: https://github.com/apache/mesos/blob/0.25.0/src/slave/slave.cpp#L3570 because it could not find the path to the executor sentinel file.
Could you please file a new JIRA at https://issues.apache.org/jira/browse/MESOS so we can track and resolve this issue for you?