I am trying to upgrade our job from flink 1.4.2 to 1.7.1 but I keep running into timeouts after submitting the job. The flink job runs on our hadoop cluster (version 2.7) with Yarn.
I've seen the following behavior:
When the timeout happens I get the following stacktrace:
INFO class java.time.Instant does not contain a getter for field seconds
INFO class com.bol.fin_hdp.cm1.domain.Cm1Transportable does not contain a getter for field globalId
INFO Submitting job 5af931bcef395a78b5af2b97e92dcffe (detached: false).
INFO ------------------------------------------------------------
INFO The program finished with the following exception:
INFO org.apache.flink.client.program.ProgramInvocationException: The main method caused an error.
INFO at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:545)
INFO at org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:420)
INFO at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:404)
INFO at org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:798)
INFO at org.apache.flink.client.cli.CliFrontend.runProgram(CliFrontend.java:289)
INFO at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:215)
INFO at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1035)
INFO at org.apache.flink.client.cli.CliFrontend.lambda$main$9(CliFrontend.java:1111)
INFO at java.security.AccessController.doPrivileged(Native Method)
INFO at javax.security.auth.Subject.doAs(Subject.java:422)
INFO at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754)
INFO at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
INFO at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1111)
INFO Caused by: java.lang.RuntimeException: org.apache.flink.client.program.ProgramInvocationException: Could not retrieve the execution result.
INFO at com.bol.fin_hdp.job.starter.IntervalJobStarter.startJob(IntervalJobStarter.java:43)
INFO at com.bol.fin_hdp.job.starter.IntervalJobStarter.startJobWithConfig(IntervalJobStarter.java:32)
INFO at com.bol.fin_hdp.Main.main(Main.java:8)
INFO at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
INFO at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
INFO at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
INFO at java.lang.reflect.Method.invoke(Method.java:498)
INFO at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:528)
INFO ... 12 more
INFO Caused by: org.apache.flink.client.program.ProgramInvocationException: Could not retrieve the execution result.
INFO at org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:258)
INFO at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:464)
INFO at org.apache.flink.streaming.api.environment.StreamContextEnvironment.execute(StreamContextEnvironment.java:66)
INFO at com.bol.fin_hdp.cm1.job.Job.execute(Job.java:54)
INFO at com.bol.fin_hdp.job.starter.IntervalJobStarter.startJob(IntervalJobStarter.java:41)
INFO ... 19 more
INFO Caused by: org.apache.flink.runtime.client.JobSubmissionException: Failed to submit JobGraph.
INFO at org.apache.flink.client.program.rest.RestClusterClient.lambda$submitJob$8(RestClusterClient.java:371)
INFO at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
INFO at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)
INFO at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
INFO at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
INFO at org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$5(FutureUtils.java:216)
INFO at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
INFO at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
INFO at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
INFO at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
INFO at org.apache.flink.runtime.rest.RestClient.lambda$submitRequest$1(RestClient.java:301)
INFO at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:680)
INFO at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:603)
INFO at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:563)
INFO at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:424)
INFO at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe$1.run(AbstractNioChannel.java:214)
INFO at org.apache.flink.shaded.netty4.io.netty.util.concurrent.PromiseTask$RunnableAdapter.call(PromiseTask.java:38)
INFO at org.apache.flink.shaded.netty4.io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:120)
INFO at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357)
INFO at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
INFO at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
INFO at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
INFO at java.lang.Thread.run(Thread.java:748)
INFO Caused by: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Number of retries has been exhausted.
INFO at org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$5(FutureUtils.java:213)
INFO ... 17 more
INFO Caused by: java.util.concurrent.CompletionException: org.apache.flink.shaded.netty4.io.netty.channel.ConnectTimeoutException: connection timed out: shd-hdp-b-slave-01...
INFO at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
INFO at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
INFO at java.util.concurrent.CompletableFuture.uniCompose(CompletableFuture.java:943)
INFO at java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:926)
INFO ... 15 more
INFO Caused by: org.apache.flink.shaded.netty4.io.netty.channel.ConnectTimeoutException: connection timed out: shd-hdp-b-slave-017.example.com/some.ip.address:46500
INFO at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe$1.run(AbstractNioChannel.java:212)
INFO ... 7 more
What changed in flip-6 that might cause this behavior and how can I fix this?
In our case, there was a firewall between the machine submitting the job and our Hadoop cluster.
In newer Flink versions (1.7 and up) Flink uses REST to submit jobs. The port number for this REST service is random on yarn setups and could not be set.
Flink 1.8.0 introduced a config option to set this to a port or port range using:
rest.bind-port: 55520-55530