Recent update from nomad v.0.9.6 to nomad v.1.01 breaks a job deployment. Unfortunately I couldn't get any usable info from nomad agent about "pending or dead" status. I also checked trace monitor from web-ui but without success.
Please could you give some advice on how to get reject/pending reason from the agent?
I use "raw_exec" driver (non-privileged user, driver.raw_exec.enable" = "1") F or deployment I use nomad-sdk (version 0.11.3.0)
You can find the job definition (from the nomad's point of view) here:
OS details:
cat /etc/redhat-release
CentOS Linux release 7.4.1708 (Core)
Linux blade1.lab.bulb.hr 3.10.0-693.21.1.el7.x86_64 #1 SMP Wed Mar 7 19:03:37 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Nomad agent details:
[root@blade1 ~]# nomad node-status
ID DC Name Class Drain Eligibility Status
5838e8b0 dc1 blade1.lab.bulb.hr <none> false eligible ready
Verbose output:
[root@blade1 ~]# nomad node-status -verbose
ID DC Name Class Address Version Drain Eligibility Status
5838e8b0-ebd3-5c47-a949-df3d601e0da1 dc1 blade1.lab.bulb.hr <none> 192.168.112.31 1.0.1 false eligible ready
[root@blade1 ~]# nomad node-status -verbose 5838e8b0-ebd3-5c47-a949-df3d601e0da1
ID = 5838e8b0-ebd3-5c47-a949-df3d601e0da1
Name = blade1.lab.bulb.hr
Class = <none>
DC = dc1
Drain = false
Eligibility = eligible
Status = ready
CSI Controllers = <none>
CSI Drivers = <none>
Uptime = 1516h1m31s
Drivers
Driver Detected Healthy Message Time
docker false false Failed to connect to docker daemon 2020-12-18T14:37:09+01:00
exec false false Driver must run as root 2020-12-18T14:37:09+01:00
java false false Driver must run as root 2020-12-18T14:37:09+01:00
qemu false false <none> 2020-12-18T14:37:09+01:00
raw_exec true true Healthy 2020-12-18T14:37:09+01:00
Node Events
Time Subsystem Message Details
2020-12-18T14:37:09+01:00 Cluster Node registered <none>
Allocated Resources
CPU Memory Disk
0/18000 MHz 0 B/53 GiB 0 B/70 GiB
Allocation Resource Utilization
CPU Memory
0/18000 MHz 0 B/53 GiB
Host Resource Utilization
CPU Memory Disk
499/20000 MHz 33 GiB/63 GiB (/dev/mapper/vg00-root)
Allocations
No allocations placed
Attributes
consul.datacenter = dacs
consul.revision = 1e03567d3
consul.server = true
consul.version = 1.8.5
cpu.arch = amd64
driver.raw_exec = 1
kernel.name = linux
kernel.version = 3.10.0-693.21.1.el7.x86_64
memory.totalbytes = 67374776320
nomad.advertise.address = 192.168.112.31:5656
nomad.revision = c9c68aa55a7275f22d2338f2df53e67ebfcb9238
nomad.version = 1.0.1
os.name = centos
os.signals = SIGTTIN,SIGUSR2,SIGXCPU,SIGBUS,SIGILL,SIGQUIT,SIGCHLD,SIGIOT,SIGKILL,SIGINT,SIGSTOP,SIGSYS,SIGTTOU,SIGFPE,SIGSEGV,SIGTSTP,SIGURG,SIGWINCH,SIGCONT,SIGIO,SIGTRAP,SIGXFSZ,SIGHUP,SIGPIPE,SIGTERM,SIGPROF,SIGABRT,SIGALRM,SIGUSR1
os.version = 7.4.1708
unique.cgroup.mountpoint = /sys/fs/cgroup/systemd
unique.consul.name = grabber1
unique.hostname = blade1.lab.bulb.hr
unique.network.ip-address = 192.168.112.31
unique.storage.bytesfree = 74604830720
unique.storage.bytestotal = 126698909696
unique.storage.volume = /dev/mapper/vg00-root
Meta
connect.gateway_image = envoyproxy/envoy:v${NOMAD_envoy_version}
connect.log_level = info
connect.proxy_concurrency = 1
connect.sidecar_image = envoyproxy/envoy:v${NOMAD_envoy_version}
Job status details
[root@blade1 ~]# nomad status
ID Type Priority Status Submit Date
lightningCollector-lightningCollector service 50 pending 2020-12-18T15:06:09+01:00
[root@blade1 ~]# nomad status lightningCollector-lightningCollector
ID = lightningCollector-lightningCollector
Name = lightningCollector-lightningCollector
Submit Date = 2020-12-18T15:06:09+01:00
Type = service
Priority = 50
Datacenters = dc1
Namespace = default
Status = pending
Periodic = false
Parameterized = false
Summary
Task Group Queued Starting Running Failed Complete Lost
lightningCollector-lightningCollector-0 0 0 0 0 0 0
Allocations
No allocations placed
Thank you for your effort and time! Regards, Ivan
I tested your job locally and was able to reproduce your experience. I noticed that ParentID was set in the job, which is used by Nomad to track child instances of periodic or dispatch jobs.
After setting the ParentID
value to ""
, I was able to submit the job and it evaluated and scheduled properly.
I did some testing over the versions and determined the behavior changed in 0.12.0 and 0.12.1. I filed hashicorp/nomad #10422 in response to this difference in behavior.