We are using Mesos 1.20 + Marathon 1.4.3 to run the SparkJob. I am trying to use an algorithm to forecast the job resource usage to achieve the auto-scale up/down. I can see the dynamic resource usage per framework in Mesos web page at http://:5050/#/agents/. However looks like from endpoint, I can only get the usage per slave, such as in below link:
finding active framework current resource usage in mesos
Is there any way through Mesos endpoint I can get the snapshot resource usage per each framework?
I tried this endpoint in mesos slave as well, looks like no cpu/memory information per framework either.
http://agent-ip:5051/metrics/snapshot/slave(1)/monitor/statistics
{
"slave/executors_terminated": 114751.0,
"slave/tasks_finished": 63594.0,
"slave/cpus_total": 8.0,
"slave/executors_preempted": 0.0,
"slave/cpus_percent": 1.0125,
"slave/executors_running": 8.0,
"slave/gpus_revocable_used": 0.0,
"slave/invalid_status_updates": 256.0,
"slave/executors_registering": 0.0,
"slave/tasks_gone": 0.0,
"slave/cpus_revocable_percent": 0.0,
"slave/gpus_total": 0.0,
"slave/tasks_killed": 50763.0,
"slave/tasks_starting": 0.0,
"slave/registered": 1.0,
"slave/gpus_revocable_total": 0.0,
....
}
Thanks
To gather this information you need to query each agent /slave/monitor/statistics/
endpoint and collect all executors metrics and group executor metrics by its framework id.
Here is a Diamond Mesos Collector that do this but it collect only single agent data. You can group them in your metric visualization tool e.g. Grafana.