Search code examples
apache-flinkflink-streaming

Flink job is interrupted after 10 minutes


I have a flink job with a global window and custom process.
The Process is failed after ~10 minutes on the next error:

java.io.InterruptedIOException

This is my job:

SingleOutputStreamOperator<CustomEntry> result = stream
                .keyBy(r -> r.getId())
                .window(GlobalWindows.create())
                .trigger(new CustomTriggeringFunction())
                .process(new CustomProcessingFunction());

The CustomProcessingFunction is run for a long time (more then 10 minutes), but after 10 minutes, the process is stoped and failed on InterruptedIOException

Is it possible t increase the timeout of flink job?


Solution

  • From Flink's point of view, that's an unreasonably long period of time for a user function to run. What is this window process function doing that takes more than 10 minutes? Perhaps you can restructure this to use the async i/o operator instead, so you aren't completely blocking the pipeline.

    That said, 10 minutes is the default checkpoint timeout interval, and you're preventing checkpoints from being able to complete while this function is running. So you could experiment with increasing execution.checkpointing.timeout. If the job is failing because checkpoints are timing out, that will help. Or you could increase execution.checkpointing.tolerable-failed-checkpoints from its default (0).