Search code examples
javamultithreadingsystemdscheduledexecutorservicejournal

ScheduledExecutorService slips for 10 minutes on a 1 minute schedule (systemd - journald fault)


I have an executor service that is supposed to write some things to disk every minute.

It is scheduled like this:

    scheduledCacheDump = new ScheduledThreadPoolExecutor(1);
    scheduledCacheDump.scheduleAtFixedRate(this::saveCachedRecords,
                                           60,
                                           60,
                                           TimeUnit.SECONDS
    );

The task uses a shared list that is filled by the main thread, so it is synchronized on that list:

   private void saveCachedRecords() {
        LOG.info(String.format("Scheduled record dump to disk. We have %d records to save.", recordCache.size()));
        synchronized (recordCache) {
            Iterator<Record> iterator = recordCache.iterator();
            while (iterator.hasNext()) {
               // save record to disk
               iterator.remove();
            }
        }
    }

My list is declared as such:

private final List<Record> recordCache = new ArrayList<>();

The main thread receives data by batch, so every second or so, it receives 30 records that it caches in the list. The rest of the time it is waiting on a socket.

What I don't understand, is that from the logs, my scheduled task is often triggered with way more that 1 minute in between:

sept. 16 09:30:43 Scheduled record dump to disk. We have 27 records to save. sept. 16 09:31:43 Scheduled record dump to disk. We have 27 records to save. sept. 16 09:32:43 Scheduled record dump to disk. We have 27 records to save. sept. 16 09:33:43 Scheduled record dump to disk. We have 27 records to save. sept. 16 09:34:43 Scheduled record dump to disk. We have 27 records to save. sept. 16 09:35:43 Scheduled record dump to disk. We have 27 records to save. sept. 16 09:42:43 Scheduled record dump to disk. We have 27 records to save. sept. 16 09:43:43 Scheduled record dump to disk. We have 27 records to save. sept. 16 09:44:43 Scheduled record dump to disk. We have 27 records to save. sept. 16 09:45:43 Scheduled record dump to disk. We have 27 records to save. sept. 16 09:46:43 Scheduled record dump to disk. We have 27 records to save. sept. 16 09:55:43 Scheduled record dump to disk. We have 27 records to save. sept. 16 09:56:43 Scheduled record dump to disk. We have 27 records to save. sept. 16 09:57:43 Scheduled record dump to disk. We have 27 records to save. sept. 16 09:58:43 Scheduled record dump to disk. We have 27 records to save. sept. 16 09:59:43 Scheduled record dump to disk. We have 27 records to save. sept. 16 10:04:43 Scheduled record dump to disk. We have 27 records to save. sept. 16 10:05:43 Scheduled record dump to disk. We have 27 records to save. sept. 16 10:06:43 Scheduled record dump to disk. We have 27 records to save.

Look at this:

  • sept. 16 09:59:43 Scheduled record dump to disk. We have 27 records to save.
  • sept. 16 10:04:43 Scheduled record dump to disk. We have 27 records to save.

=> 5 minutes

Or even:

  • sept. 16 09:46:43 Scheduled record dump to disk. We have 27 records to save.
  • sept. 16 09:55:43 Scheduled record dump to disk. We have 27 records to save.

=> 9 minutes

My log is in the synchronized() scope, so I don't know if the task is actually scheduled on time and waits 10 minutes on the lock, or if it is just a real scheduling issue. I will be moving it out of it, but generally I can't fathom how a thread could stay blocked for 10 minutes on a lock that is released about every second.

How can I investigate this?

For information: the machine on which it runs is a KVM machine, could this be a factor?


Solution

  • Oh my..

    This was not Java's fault at all. This was systemd's fault.

    My process runs as a systemd service, so the logs I pull come from systemd-journald. And guess what, there's a rate limiting in systemd. And my daemon would hit it when the scheduler was triggered, so I got PLENTY of lines like this:

    Suppressed 570 messages from /system.slice/xxx.service Suppressed 769 messages from /system.slice/xxx.service Suppressed 745 messages from /system.slice/xxx.service Suppressed 729 messages from /system.slice/xxx.service Suppressed 717 messages from /system.slice/xxx.service Suppressed 95 messages from /system.slice/xxx.service Suppressed 543 messages from /system.slice/xxx.service

    So.. yeah, I removed the throttling on journald and I have my traces every minute now.

    The solution was:

    • edit /etc/systemd/journald.conf
    • add line: RateLimitInterval=0 at the end
    • execute systemctl restart systemd-journald
    • execute systemctl restart myservice

    Now I have all things going smoothly.

    I'll update the title for future reference :)