Search code examples
mesos

How can I find out why Mesos Chronos job fails?


I used to use cron for my backup routine and everything was fine:

tar c --exclude=owncloud --exclude=hadoop -C /var/opt . | pigz -c -p 4 --best 
| hadoop fs -put - /apps/appBackups/myserver_var_opt_$(date +\%Y-\%m-\%d_\%H-\%M-\%S).tar.gz

When I moved it to Mesos Chronos, it started failing from time to time even if I force run it:

ssh root@myserver <<'ENDSSH' bash daily_opt_backup.sh ENDSSH

mesos-master.INFO logs are not descriptive enough - they show a state of a task (TASK_RUNNING, ACKNOWLEDGE call, TASK_FINISHED, and UUIDs) but not the reason why the task failed. Where could I find this information?


Solution

  • Job fails as some slaves does not have private keys to log in as root. The proper way is put a script to HDFS so every mesos-slave could copy and run it:

    hadoop fs -get /apps/utils/daily_opt_backup.sh && chmod +x daily_opt_backup.sh
     && ./daily_opt_backup.sh