Search code examples
jenkinsmesosdcosfault-tolerance

Fault tolerant Jenkins on DCOS


I am running a Jenkins server on DCOS as documented here https://docs.mesosphere.com/1.7/usage/tutorials/jenkins/.

The Jenkins server is able to spawn new mesos slaves when new jobs are scheduled and kill them when the job is completed.

But if a cluster node crashes, having a Jenkins job running on it, Jenkins server doesn't re-run the job on other available nodes.

Is the Jenkins service on DCOS fault tolerant? Can we re-run the job(on some other available node) that failed due to cluster node crashed in between execution of the job?


Solution

  • Jenkins itself does not rerun jobs that disappear. It is not specific to DC/OS or Mesos, it's just the way Jenkins works.

    DC/OS and Mesos will make sure that Jenkins stays running and available to send jobs to, and in this way, it is "fault tolerant", but in the way you are asking about it isn't.