Search code examples
configurationmesosmesosphere

Mesos dcos never allows services to start when cpu resources avg used are 71%


I have a problem where we are trying to run several services on our mesos dcos cluster and some are running spark process and some python services. So in our small test mesos dcos cluster we reach 70% cpus resources used multiple times per day.

And services people want to start just get hanging waiting for cpu offers that can be well met on slave nodes but for some reson are not allowed to be allocated.

A typical example would be 7 total cpus unused and 1-3 services looking for cpu offers of 0.5 to 2 cpu resources to use. that can be met. if looking on the node resource over view.

To my question are there a hard limit not allowing more then 70% of the cpus to be allocated at the same time?

And are there a reson for this limit what would be the effect of changing this to a higher value?

And last who do we change the limit?


Solution

  • The answer seems to be what stated in mesospheres documentation for debug scenarios 1.

    But instead of the problem being a role problem or that we are trying to just allocate straight up more then the cluster can handle. https://docs.mesosphere.com/1.11/tutorials/dcos-debug/scenarios/scen-1/ the problem was that some of our service are keeping cpu resources in reserved.

    reserved_resources":{"cassandra-role":{"disk":10496.0,"mem":5152.0,"gpus":0.0,"cpus":1.6,"ports":"[7000-7001, 7199-7199, 9042-9042]"} "kafka-role":{"disk":5256.0,"mem":2080.0,"gpus":0.0,"cpus":1.1,"ports":"[1025-1025]"}}

    giving a total of 2.8 or in mesos 2.81

    given that the slave node in this case have a maximum of 4 cpus remning in should should be 1.19 and that is the amount that i can request and still get the resources.

    This was quite misleading when trying to find the answer because the GUI only shows the the used and not the reserved.

    I was able to find the answer by going through https:///mesos/state-summary

    Just to show one more thing i found one the node was "hostname":"1.0.1.199","port":5051,"attributes":{},"pid":"slave(1)@1.0.1.199:5051","registered_time":1526561517.17816,"reregistered_time":1526561517.17896,"resources":{"disk":119266.0,"mem":29476.0,"gpus":0.0,"cpus":4.0,"ports":"[1025-2180, 2182-3887, 3889-5049, 5052-8079, 8082-8180, 8182-32000]"},"used_resources":{"disk":15752.0,"mem":6368.0,"gpus":0.0,"cpus":1.81,"ports":"[1025-1025, 7000-7001, 7199-7199, 9042-9042]"},"offered_resources":{"disk":0.0,"mem":0.0,"gpus":0.0,"cpus":0.0},"reserved_resources":{"cassandra-role":{"disk":10496.0,"mem":5152.0,"gpus":0.0,"cpus":1.6,"ports":"[7000-7001, 7199-7199, 9042-9042]"},"kafka-role":{"disk":5256.0,"mem":2080.0,"gpus":0.0,"cpus":1.1,"ports":"[1025-1025]"}},"unreserved_resources":{"disk":103514.0,"mem":22244.0,"gpus":0.0,"cpus":1.3,"ports":"[1026-2180, 2182-3887, 3889-5049, 5052-6999, 7002-7198, 7200-8079, 8082-8180, 8182-9041, 9043-32000]"}

    unreserved_resources: gives "cpus":1.3," this value i don't understand why it is one 1.3 and not 1.19. given that 1.19 is what the debug page shows as well as what i can ask for and get from server 1.0.1.199.