apache-flink flink-stateful-functions apache-stateful-functions

Flink Stateful Functions - unreliable behavior in org.apache.flink.statefun.sdk.Context::sendAfter

We developed a timeout like mechanism with SF (Stateful Functions) where we have a TimeoutManagerFunction which keeps and manages a state of current timeouts set by other functions and sends back an expiration message when it times out.

To achieve that behavior, our TimeoutManagerFunction sends itself a message after X seconds. (X represents the number of seconds that need to pass for timeout to actually time out, passed from timeout setter function while it is set initially)

Timeout setter functions can also update a timeout's expiration by sending an update message to TimeoutManagerFunction or can cancel it. TimeoutManagerFunction edits the expiration or cancels it depending on the message received.

We currently have 10-ish different functions interacting with this TimeoutManagerFunction and some of them are setting/updating/canceling a timeout pretty much every second.

So, on set(), update() and cancel() methods of TimeoutManagerFunction we basically do something like:

context.sendAfter(Duration.ofMillis(cancelableTimeout.getTimeoutDuration()), context.self(), selfTimeout);

and on timeout() method of TimeoutManagerFunction we check if given timeout is indeed expired (because there is a possibility that its expiration date is updated in the meantime):

protected void timeout(Context context, SelfTimeout selfTimeout) {
        try {
            // **timeouts** is the internal state of timeouts in TimeoutManagerFunction
            CancelableTimeout cancelableTimeout = timeouts.get(selfTimeout.getId());
            if (cancelableTimeout != null) {
                if (Instant.now().isAfter(Instant.ofEpochMilli(cancelableTimeout.getExpiresAt()))) {
                    /// send back timed out message back to initial timeout setter function...

The issue we experiencing is basically if we do something like context.sendAfter(10 seconds, timeout object), sometimes it takes more than 10 seconds to receive that message back. Any ideas on how or why that might be happening is appreciated.

Our SF version is 2.2.1, Flink version is 1.11.6

Solution

There's no guarantee that the requested timing will be strictly observed. If the cluster is under-provisioned relative to the amount of processing being done then it's more likely that some of the timers will be triggered late.