Search code examples
mesos

Why is Mesos framework not being offered resources?


I am using Mesos 1.0.1. I have added an agent with a new role docker_gpu_worker. I register a framework with this role. The framework does not receive any offers. Other frameworks (same Java code) using other roles are working fine. I have not restarted the three Mesos masters. Does anyone have an idea about what might be going wrong?

At master/frameworks, I see my framework:

"{
  "id": "fd01b1b0-eb73-4d40-8774-009171ae1db1-0701",
  "name": "/data4/Users/mikeb/jobs/999",
  "pid": "[email protected]:57617",
  "used_resources": {
    "disk": 0,
    "mem": 0,
    "gpus": 0,
    "cpus": 0
  },
  "offered_resources": {
    "disk": 0,
    "mem": 0,
    "gpus": 0,
    "cpus": 0
  },
  "capabilities": [],
  "hostname": "x-x-x-x.ec2.internal",
  "webui_url": "",
  "active": true,
  "user": "mikeb",
  "failover_timeout": 10080,
  "checkpoint": true,
  "role": "docker_gpu_worker",
  "registered_time": 1507028279.18887,
  "unregistered_time": 0,
  "principal": "test-framework-java",
  "resources": {
    "disk": 0,
    "mem": 0,
    "gpus": 0,
    "cpus": 0
  },
  "tasks": [],
  "completed_tasks": [],
  "offers": [],
  "executors": []
}"

At master/roles I see my role:

"{
  "frameworks": [
    "fd01b1b0-eb73-4d40-8774-009171ae1db1-0701",
    "fd01b1b0-eb73-4d40-8774-009171ae1db1-0673",
    "fd01b1b0-eb73-4d40-8774-009171ae1db1-0335"
  ],
  "name": "docker_gpu_worker",
  "resources": {
    "cpus": 0,
    "disk": 0,
    "gpus": 0,
    "mem": 0
  },
  "weight": 1
}"

At master/slaves I see my agent:

"{
  "id": "fd01b1b0-eb73-4d40-8774-009171ae1db1-S5454",
  "pid": "slave(1)@x.x.x.x:5051",
  "hostname": "x-x-x-x.ec2.internal",
  "registered_time": 1506692213.24938,
  "resources": {
    "disk": 35056,
    "mem": 59363,
    "gpus": 4,
    "cpus": 32,
    "ports": "[31000-32000]"
  },
  "used_resources": {
    "disk": 0,
    "mem": 0,
    "gpus": 0,
    "cpus": 0
  },
  "offered_resources": {
    "disk": 0,
    "mem": 0,
    "gpus": 0,
    "cpus": 0
  },
  "reserved_resources": {
    "docker_gpu_worker": {
      "disk": 35056,
      "mem": 59363,
      "gpus": 4,
      "cpus": 32,
      "ports": "[31000-32000]"
    }
  },
  "unreserved_resources": {
    "disk": 0,
    "mem": 0,
    "gpus": 0,
    "cpus": 0
  },
  "attributes": {},
  "active": true,
  "version": "1.0.1",
  "reserved_resources_full": {
    "docker_gpu_worker": [
      {
        "name": "gpus",
        "type": "SCALAR",
        "scalar": {
          "value": 4
        },
        "role": "docker_gpu_worker"
      },
      {
        "name": "cpus",
        "type": "SCALAR",
        "scalar": {
          "value": 32
        },
        "role": "docker_gpu_worker"
      },
      {
        "name": "mem",
        "type": "SCALAR",
        "scalar": {
          "value": 59363
        },
        "role": "docker_gpu_worker"
      },
      {
        "name": "disk",
        "type": "SCALAR",
        "scalar": {
          "value": 35056
        },
        "role": "docker_gpu_worker"
      },
      {
        "name": "ports",
        "type": "RANGES",
        "ranges": {
          "range": [
            {
              "begin": 31000,
              "end": 32000
            }
          ]
        },
        "role": "docker_gpu_worker"
      }
    ]
  },
  "used_resources_full": [],
  "offered_resources_full": []
}"

We have tracked the problem to this Mesos agent config:

--isolation="filesystem/linux,cgroups/devices,gpu/nvidia"

Removing that, the agent works properly, but without access to GPU resources. This config is a requirement according to the docs for Nvidia GPU support and those docs seem to indicate that version 1.0.1 supports it. We are continuing to investigate.


Solution

  • The GPU_RESOURCES capability must be enabled for frameworks.

    As illustrated in http://mesos.readthedocs.io/en/latest/gpu-support/, this can be achieved for example by specifying --framework_capabilities="GPU_RESOURCES" in the mesos-execute command, or with code like this in C++:

    FrameworkInfo framework;
    framework.add_capabilities()->set_type(
        FrameworkInfo::Capability::GPU_RESOURCES);
    

    For Marathon frameworks instead, the Marathon service must be started with the option --enable_features "gpu_resources" as indicated in Enable GPU resources (CUDA) on DC/OS