What is stalling my systemd user timer?

I use systemd user timers as a cron replacement. I have a particular program set to execute every 20 minutes. The program is not a daemon, is network-dependent, and launches a number of child processes. I've noticed however that the timer frequently stalls after a few hours (or days). The timer is still active, yet the program is no longer executed every 20 minutes. pgrep shows a number of processes still active. After observing this, I added JobTimeoutSec=3m to the .service file with the expectation that the processes would be killed if they timed out.

systemctl status --user PROGRAM.service now outputs the following however the child processes are still running and the timer is no longer executing the program every 20 minutes:

Feb 13 15:03:45 HOSTNAME systemd[1878]: Job PROGRAM.service/start timed out.

Feb 13 15:03:45 HOSTNAME systemd[1878]: Timed out starting DESCRIPTION.

Feb 13 15:03:45 HOSTNAME systemd[1878]: Job PROGRAM.service/start failed with result 'timeout'.

I'd guess that the program's child processes are stalling due to network difficulties and systemd fails to kill them upon timeout.

Any suggestions for resolving this so that the timer continues as expected?

Replacing ExecStart=/path/to/program with ExecStart=/usr/bin/timeout 20m /path/to/program appears to solve this, but I'd like to find out why systemd alone does not.

Debugging Information

PROGRAM.service

[Unit]
Description=DESCRIPTION
After=network.target
PartOf=network-online.target
JobTimeoutSec=3m

[Service]
Type=oneshot
ExecStart=/path/to/program

[Install]
WantedBy=network-online.target

PROGRAM.timer

[Unit]
Description=Run PROGRAM.service every 20 minutes

[Timer]
OnCalendar=*:0/20

[Install]
WantedBy=timers.target

systemd --version outputs the following:

systemd 219

+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT -GNUTLS +ACL +XZ -LZ4 -SECCOMP +BLKID -ELFUTILS +KMOD -IDN

Solution

Active Process

There are two important things in systemd which I think you are hitting in this case:

When you start a process with systemd, all the child processes (at least by default) are part of the same group.
If any one of those children does not die, it is considered that the process is still (at least somewhat) running.

What does that mean?

The timer description says:

Note that in case the unit to activate is already active at the time the timer elapses it is not restarted, but simply left running.

In other words, if any one of your processes is still running 20 minutes later, the timer system will not restart anything.

Why does this make sense?!

CRON was doing exactly the same thing. If you process was still running, it would not restart it over and over again (because that would just fill up memory and possibly break many other things.) However, CRON had no concept of process group. So if your main process did die, it assumed that it could restart it.

What is the systemd solution?

Assuming you cannot just stop the child processes (although since you used the /usr/bin/timedout, you probably can?), one way is to use the KillMode option, although I do not recommend it:

KillMode=process

This means once the main process died, it is considered that the service stopped.

If set to process, only the main process itself is killed.

You may want to test whether that really works, since according to the documentation it does not say it will consider the whole group as dead... But from my experience, that works.

What is a better solution then?

Since I do not recommend the KillMode, there should be another solution. The fact is that all your processes either have 20 minutes to run (or whatever amount of time remains at the time they are spawned) or they will prevent the following run to happen, which may be okay once in a while, but certainly not if they stay around forever. So it would be to edit those processes and make sure they quit after a while.

However, after a long while, it may be necessary to kill those processes and using the timeout tool as you've done could be the best solution if the processes themselves cannot just quit on time. Although I would suggest one small modification, which is to use 19 min. for the timeout, because otherwise you may miss the next startup window.

ExecStart=/usr/bin/timeout 19m /path/to/program