Hazelcast OperationTimeoutException on remote execution

I'm running a 5-node Hazelcast cluster at version 3.6.6 in AWS. I'm using it as a workload distributor and I'm using

IExecutorService
<T> void submit(Runnable  task,
              MemberSelector memberSelector,
              ExecutionCallback<T> callback)

API to execute tasks on a member of my choice. I do not use partition-based balancing since different partitions would have different weights.

After I start the cluster it works well for several days and then submitting members start to receive OperationTimeoutException. Once it starts, all members start to receive this timeout and it happens quite sporadically, there maybe a short period when all works smooth and then this exception starts happening again. The target member does receive the task within less than a second and executes it correctly. The exception itself looks like this:

July 3rd 2019, 10:54:01 UTC:

No response for 560000 ms. Aborting invocation! Invocation{serviceName='hz:impl:executorService', op=com.hazelcast.executor.impl.operations.MemberCallableTaskOperation{identityHash=1179024466, serviceName='hz:impl:executorService', partitionId=-1, replicaIndex=0, callId=684145, invocationTime=1562150679963 (Wed Jul 03 10:44:39 UTC 2019), waitTimeout=-1, callTimeout=500000, name=exec_service_3}, partitionId=-1, replicaIndex=0, tryCount=250, tryPauseMillis=500, invokeCount=1, callTimeout=500000, target=Address[x.x.x.x]:5701, backupsExpected=0, backupsCompleted=0, connection=Connection [/x.x.x.x:5701 -> /x.x.x.x:35360], endpoint=Address[x.x.x.x]:5701, alive=true, type=MEMBER} No response has been received! backups-expected:0 backups-completed: 0, execution took: 3445 milliseconds

Stacktrace:

at com.hazelcast.spi.impl.operationservice.impl.Invocation.newOperationTimeoutException(Invocation.java:536) ~[anodot-arnorld-1.0-SNAPSHOT.jar:na]
    at com.hazelcast.spi.impl.operationservice.impl.IsStillRunningService$IsOperationStillRunningCallback.setOperationTimeout(IsStillRunningService.java:241)
    at com.hazelcast.spi.impl.operationservice.impl.IsStillRunningService$IsOperationStillRunningCallback.onResponse(IsStillRunningService.java:229)
    at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture$1.run(InvocationFuture.java:127)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_121]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
    at com.hazelcast.util.executor.HazelcastManagedThread.executeRun(HazelcastManagedThread.java:76) [anodot-arnorld-1.0-SNAPSHOT.jar:na]
    at com.hazelcast.util.executor.HazelcastManagedThread.run(HazelcastManagedThread.java:92)
    at ------ End remote and begin local stack-trace ------.(Unknown Source) ~[na:na]
    ... 8 frames truncated

The timing of the exception is quite weird:

July 3rd 2019, 10:53:57.956 - task submitted for execution by sending instance
July 3rd 2019, 10:53:58.024 - execution starts on target instance
July 3rd 2019, 10:54:01.391 - the sending instance receives the exception

In my logs I see that the timeout happens shortly after the task was submitted and the "execution took:" part of the exception is quite precise, indeed in the quoted case the time that passed since the task was sent to execution was about 3.5 seconds. On the other hand, the "invocationTime", (Wed Jul 03 10:44:39 UTC 2019) in quoted case, is about 10 minutes back in the past, even before the job was actually submitted for execution (July 3rd 2019, 10:53:57 UTC)

I've seen this exception being attributed to long GC pauses but as I'm constantly monitoring GC, I'm quite sure this is not the case. Also, the networking between cluster members looks quite live, latencies are low.

From what I've seen in the Hazelcast code, the "invocationTime" is taken from the "clusterClock" and not directly from the system time, suggesting that for some reason the cluster clock is 10 minutes off but I can't figure out why that happens. The cluster is quite busy but I don't see any exceptional surges in load when this exception starts to happen. The problem disappears when I take down the whole cluster and then start it anew. I'm planning to add monitoring on the clusterTime to see when it starts to drift but it'll still won't explain why that happens. Any thoughts?

Update: In short, the cluster time drifts over time from the system time and once the gap is big enough the tasks start to fail with the timeout exception. For details: https://github.com/hazelcast/hazelcast/issues/15339

Solution

Finally, the issue was resolved by upgrading Hazelcast version to 3.12.11 (4.x.x breaks too many things), looks like the way cluster time is managed there is insensitive to GC pauses. Some APIs were broken and needed adjustments in code, nothing too serious. A note of warning, 3.6.6 is incompatible with 3.12.11 so a rolling cluster upgrade is impossible. We did a full cluster restart, luckily, it was possible.