Search code examples
dockercentosmesosmesospheremarathon

Destroy Docker container from Marathon kills Mesos slave


We have a Mesos cluster and launches tasks by Marathon on Mesos-Slave with Docker container.

The whole system runs very well but a very strange problem occurred from time to time: when we try to destroy/re-deploy a task through Marathon, the mesos-slave got killed by the exiting of the target Docker container. This is the error log I got:

Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: I0229 19:31:51.465544  4094 docker.cpp:1592] Executor for container 'eadfb756-b653-42eb-977a-c16c78b1a7c5' has exited
Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: I0229 19:31:51.465736  4094 docker.cpp:1390] Destroying container 'eadfb756-b653-42eb-977a-c16c78b1a7c5'
Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: I0229 19:31:51.465812  4094 docker.cpp:1494] Running docker stop on container 'eadfb756-b653-42eb-977a-c16c78b1a7c5'
Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: I0229 19:31:51.466089  4098 slave.cpp:3440] Executor 'prod-xxxxxxx-data-collector-writer.6d832d68-d519-11e5-acca-00505692154c' of framework 8d26b713-c3cd-4e9b-956d-24f63b1320e0-0000 exited with status 0
Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: I0229 19:31:51.466167  4098 slave.cpp:3544] Cleaning up executor 'prod-xxxxxxx-data-collector-writer.6d832d68-d519-11e5-acca-00505692154c' of framework 8d26b713-c3cd-4e9b-956d-24f63b1320e0-0000
Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: F0229 19:31:51.470055  4098 slave.cpp:3570] CHECK_SOME(os::touch(path)): Failed to open file: No such file or directory
Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: *** Check failure stack trace: ***
Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: @     0x7f8c3c2144dd  google::LogMessage::Fail()
Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: @     0x7f8c3c21621c  google::LogMessage::SendToLog()
Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: I0229 19:31:51.566812  4099 docker.cpp:1592] Executor for container 'e2d9c750-88b7-4247-b696-6589665d6a66' has exited
Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: @     0x7f8c3c2140cc  google::LogMessage::Flush()
Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: I0229 19:31:51.569646  4099 docker.cpp:1390] Destroying container 'e2d9c750-88b7-4247-b696-6589665d6a66'
Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: I0229 19:31:51.569757  4099 docker.cpp:1592] Executor for container 'f51c68b8-c64d-47ea-a629-8516dcc90dba' has exited
Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: I0229 19:31:51.569787  4099 docker.cpp:1390] Destroying container 'f51c68b8-c64d-47ea-a629-8516dcc90dba'
Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: I0229 19:31:51.569818  4099 docker.cpp:1494] Running docker stop on container 'e2d9c750-88b7-4247-b696-6589665d6a66'
Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: I0229 19:31:51.569849  4099 docker.cpp:1494] Running docker stop on container 'f51c68b8-c64d-47ea-a629-8516dcc90dba'
Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: @     0x7f8c3c216b19  google::LogMessageFatal::~LogMessageFatal()
Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: @     0x7f8c3bc99f2e  mesos::internal::slave::Slave::removeExecutor()
Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: @     0x7f8c3bcaca60  mesos::internal::slave::Slave::executorTerminated()
Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: @     0x7f8c3c1c6541  process::ProcessManager::resume()
Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: @     0x7f8c3c1c683f  process::internal::schedule()
Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: @     0x7f8c3ad4a1e0  (unknown)
Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: @     0x7f8c3afa3df5  start_thread
Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: @     0x7f8c3a7b41ad  __clone
Feb 29 19:31:51 mesos-slave3.ourcompany.com systemd[1]: mesos-slave.service: main process exited, code=killed, status=6/ABRT
Feb 29 19:31:51 mesos-slave3.ourcompany.com systemd[1]: Unit mesos-slave.service entered failed state.
Feb 29 19:32:11 mesos-slave3.ourcompany.com systemd[1]: mesos-slave.service holdoff time over, scheduling restart.

The task launched in Docker container is a AKKA application, and the environment info for the whole system is:

OS:

CentOS Linux release 7.1.1503 (Core)

Kernel:

3.10.0-229.el7.x86_64

JDK on all machine:

java version "1.7.0_91"
OpenJDK Runtime Environment (rhel-2.6.2.1.el7_1-x86_64 u91-b00)
OpenJDK 64-Bit Server VM (build 24.91-b01, mixed mode)

Mesos:

0.25, installed by yum from mesosphere repo

Mesos-Master config:

--zk=zk://zk-node1:2181,zk-node2:2181,zk-node3:2181,zk-node4:2181,zk-node5:2181/mesos-cluster --port=5050 --log_dir=/var/log/mesos --cluster=mesos-prod-cluster --hostname=<real hostname> --ip=<real ip> --quorum=3 --registry_fetch_timeout=5mins --work_dir=/var/lib/mesos

Mesos-Slave config:

--master=zk://zk-node1:2181,zk-node2:2181,zk-node3:2181,zk-node4:2181,zk-node5:2181/mesos-cluster --log_dir=/var/log/mesos --attributes=env:prod --containerizers=docker,mesos --docker_remove_delay=2weeks --executor_registration_timeout=30mins --hostname=<real slave hostname>

Marathon info:

{
"name": "marathon",
"version": "0.11.1",
"elected": true,
"leader": "<leader_ip>:8080",
"frameworkId": "8d26b713-c3cd-4e9b-956d-24f63b1320e0-0000",
"marathon_config": {
    "master": "zk-node1:2181,zk-node2:2181,zk-node3:2181,zk-node4:2181,zk-node5:2181/mesos-cluster",
    "failover_timeout": 604800,
    "framework_name": "marathon",
    "ha": true,
    "checkpoint": true,
    "local_port_min": 10000,
    "local_port_max": 20000,
    "executor": "//cmd",
    "hostname": "<hostname>",
    "webui_url": null,
    "mesos_role": null,
    "task_launch_timeout": 600000,
    "reconciliation_initial_delay": 15000,
    "reconciliation_interval": 300000,
    "marathon_store_timeout": 2000,
    "mesos_user": "root",
    "leader_proxy_connection_timeout_ms": 5000,
    "leader_proxy_read_timeout_ms": 10000,
    "mesos_leader_ui_url": "http://<leader_ip>:5050/"
},
"zookeeper_config": {
    "zk": "zk-node1:2181,zk-node2:2181,zk-node3:2181,zk-node4:2181,zk-node5:2181/marathon-cluster",
    "zk_timeout": 10000,
    "zk_session_timeout": 1800000,
    "zk_max_versions": 25
},
"event_subscriber": {
    "type": "http_callback",
    "http_endpoints": null
},
"http_config": {
    "assets_path": null,
    "http_port": 8080,
    "https_port": 8443
}

}

Docker version:

Client:
 Version:      1.9.1
 API version:  1.21
 Go version:   go1.4.2
 Git commit:   a34a1d5
 Built:        Fri Nov 20 13:25:01 UTC 2015
 OS/Arch:      linux/amd64

Server:
 Version:      1.9.1
 API version:  1.21
 Go version:   go1.4.2
 Git commit:   a34a1d5
 Built:        Fri Nov 20 13:25:01 UTC 2015
 OS/Arch:      linux/amd64

Docker info:

Containers: 330
Images: 509
Server Version: 1.9.1
Storage Driver: devicemapper
 Pool Name: docker-253:0-68977907-pool
 Pool Blocksize: 65.54 kB
 Base Device Size: 107.4 GB
 Backing Filesystem:
 Data file: /dev/loop0
 Metadata file: /dev/loop1
 Data Space Used: 23.68 GB
 Data Space Total: 107.4 GB
 Data Space Available: 27.51 GB
 Metadata Space Used: 63.75 MB
 Metadata Space Total: 2.147 GB
 Metadata Space Available: 2.084 GB
 Udev Sync Supported: true
 Deferred Removal Enabled: false
 Deferred Deletion Enabled: false
 Deferred Deleted Device Count: 0
 Data loop file: /var/lib/docker/devicemapper/devicemapper/data
 Metadata loop file: /var/lib/docker/devicemapper/devicemapper/metadata
 Library Version: 1.02.93-RHEL7 (2015-01-28)
Execution Driver: native-0.2
Logging Driver: json-file
Kernel Version: 3.10.0-229.el7.x86_64
Operating System: CentOS Linux 7 (Core)
CPUs: 4
Total Memory: 15.67 GiB
Name: mesos-slave3.gz.yougola.com
ID: QB4G:C2HK:CBPR:G5ID:6OCU:DFEC:USBP:ECLQ:FWOQ:ZGHS:JIU5:JNN4

Services including Docker, Mesos-Master, Mesos-Slave, Marathon are all managed by systemd.


Solution

  • That is strange and unfortunate. Looks like it's failing this check: https://github.com/apache/mesos/blob/0.25.0/src/slave/slave.cpp#L3570 because it could not find the path to the executor sentinel file.

    Could you please file a new JIRA at https://issues.apache.org/jira/browse/MESOS so we can track and resolve this issue for you?