Search code examples
mesosmarathondcos

Mesos marathon cannot destroy job


I have a dcos cluster that is running a website. The website runs on 20 docker instances. When I'm looking at my application I see that I have 24 instances. Where 2 instances have status started but health unknown and 2 have status staged. The old instance where from a previous deploy I tried the follow things:

  • destroy the application (result: Error destroying /azure-tracking-api: Futures timed out after [10000 milliseconds])
  • kill all instances (result: they all restart )

In the log I don't see any major errors except

Cannot kill task azure-tracking-api.908a6c3e-8948-11e6-be5a-7e591cfeda59 of framework 517c75b9-0a13-4b3b-a29d-8d754239991b-0000 (marathon) at [email protected]:42546 because it is unknown; performing reconciliation

The version that I use is 0.28.1

My question is can I fix this with a couple of commands. The only way that I know how to fix this is to setup a new cluster.


Solution

  • The Marathon version you're using (1.1.2) has known issues with lost tasks. Once DC/OS 1.8 is available on Azure the best option is to upgrade. As a workaround, for now, you can manually delete a task using Marathon's HTTP API:

    $ curl -X DELETE $MARATHON_URL/v2/apps/azure-tracking-api/tasks/$TASKID?force=true