distributed-computing distributed-system mesos mesosphere

Is there a timeout for an executor to register with Mesos master?

I have a 200 node mesos cluster that can run around 2700 executors concurrently. Around 10-20% of my executors are LOST at the very beginning. They go only until extracting the executor tar file.

    WARNING: Logging before InitGoogleLogging() is written to STDERR
    I0617 21:35:09.947180 45885 fetcher.cpp:76] Fetching URI 'http://download_url/remote_executor.tgz'
    I0617 21:35:09.947273 45885 fetcher.cpp:126] Downloading 'http://download_url/remote_executor.tgz' to '/mesos_dir/remote_executor.tgz' 
    I0617 21:35:57.551722 45885 fetcher.cpp:64] Extracted resource '/mesos_dir/remote_executor.tgz' into '/extracting_mesos_dir/'

My executor tar ball is pretty large (around 40 MB or so) and most of the executors that take 30+ secs to download get LOST. Does mesos master wait for all executors until a certain time period to register and marks them LOST if the executors fail to do that?

Executor details:

I am using python to implement both the scheduler and executor. The executor code is a python file that extends base class 'Executor'. I have implemented the launchTasks method of Executor class that simply does what the executor is supposed to do.

The executor info is:

    executor = mesos_pb2.ExecutorInfo()
    executor.executor_id.value = "executor-%s" % (str(task_id),)
    executor.command.value = 'python -m myexecutor'

    # where to download executor from
    tar_uri = '%s/remote_executor.tgz' % (
        self.conf.remote_executor_cache_url)
    executor.command.uris.add().value = tar_uri
    executor.name = 'some_executor_name'
    executor.source = "executor_test"

Solution

The default timeout for an executor with an slave is 1 minute and can be changed with the --executor_registration_timeout slave flag.

From Mesos Configuration

--executor_registration_timeout=VALUE Amount of time to wait for an executor to register with the slave before considering it hung and shutting it down (e.g., 60secs, 3mins, etc) (default: 1mins)