Search code examples
apache-flinkbatch-processing

why flink batch job make print(), count() as single job


I'm write a flink batch job, and add many print() to my DateSet<> for debug. Then deploy the job in k8s with job cluster mode which job manager is a k8s job, and it stopped at first print finish.

Finally I open a ExecutionEnvironment.createLocalEnvironmentWithWebUI(config) in local environment, I found flink execute job one by one with different job id, these jobs are sub jobs of my full job.

If So, why flink design this mechanism? do I need to delete all print() function in production env?


Solution

  • As per the DataSet.print documentation:

    This method immediately triggers the program execution, similar to the collect() and count() methods

    So no, you can't sprinkle print() statements throughout your workflow. You can create a FilterFunction that (a) never filters anything, and (b) uses logging statements to record the data being passed to it. Note that you need to be careful to not be processing much data, as otherwise you can fill up a node's disk with the logging output.