How do I understand why my Flink TaskManager quits shortly after starting my job?

I'm using Flink 1.15 Docker images in Session mode pretty much the same as the Compose documentation. I have one Task Manager. A few minutes after starting my streaming job I get a stack dump log message from my Job Manager stating that the Task Manager is no longer reachable and I see that my Task Manager Docker container has exited with code 137 - which possibly indicates an out of memory error. Although docker inspect shows the OOMKilled flag as false indicating some sort of other issue.

End of stack trace from Job Manager log:

Caused by: org.apache.flink.runtime.jobmaster.JobMasterException: TaskManager with id 172.18.0.5:44333-7c7193 is no longer reachable.

The TaskManager Docker logs produce no error whatsoever before exiting. If I resurrect the dead Task Manager Docker container and have a look at the log file in /opt/flink/logs/ then the last messages state that the various components in my pipeline have switched from INITIALIZING to RUNNING.

I would have expected an out of memory stack dump from the task manager if my state had become too large. Also docker inspect shows that the container did not exit because of an out of memory error.

I have no idea what causes my Task Manager to die. Any ideas how I can figure out what is causing the issue? (This happens on 1.15.1 & 1.15.2. I haven't used any other version of Flink.)

Solution

I ended up using nothing more sophisticated than trial and error with a variety of different test jobs. I'm not 100% sure I fixed the problem as the issue of the Task Manager crashing without an stack dump occurred sporadically. However the Task Manager hasn't crashed for several days.

The simplest job to recreate my issue was with a SourceFunction outputting a continuous stream of incrementing Longs straight to a DiscardingSink. With this setup the Task Manager would crash after a while on my Linux machine sporadically but never on my Mac.

If I added a Thread.sleep to the SourceFunctions run loop then the crash would eventually occur but take a bit longer.

I tried Source framework instead of SourceFunction where a SingleThreadMultiplexSourceReaderBase repeatedly calls fetch on a SplitReader to output the Longs. There have been fewer crashes since I did this so it didn't work 100%.

I presume my SourceFunction was overfilling some sort of buffer or making a task slot unresponsive as it never relinquished a slot once it started. (Or some other completely different explanation.)

I wish the Task Manager gave some sort of indication why it stopped running.