Search code examples
azuremesos

How to remove an agent from a mesos cluster?


In a IaaS context (Azure, actually), I removed a machine from our mesos cluster without scheduling the maintenance window to tear it down before.

This agent and the tasks belonging to it now appears as "unreachable" on the UI, I tried to use /maintenance/schedule and /machine/down, which worked, but the agent and tasks are still appearing as "unreachable" on the UI, any way to get rid of it ?


Solution

  • It's up to your framework what to do with this task. Mesos itself lost connection with an agent which does not respond to healthcheck. This caused marking agent and all its tasks as unreachable. If a framework is partition-aware it should handle this situation. If not you may need to wait until the task is marked as failed.

    --agent_reregister_timeout=VALUE The timeout within which an agent is expected to re-register. Agents re-register when they become disconnected from the master or when a new master is elected as the leader. Agents that do not re-register within the timeout will be marked unreachable in the registry; if/when the agent re-registers with the master, any non-partition-aware tasks running on the agent will be terminated. NOTE: This value has to be at least 10mins. (default: 10mins)