I'm creating an EMR cluster in a private subnet and currently am struggling to get the EMR cluster to properly create.
I have NAT Gateways in all of my public subnets and my private subnet route tables all have a route to the NAT Gateway in their AZ. Regarding the EMR Cluster configuration, everything that is optional I leave blank right now to create the simplest starting configuration.
I am using a vpc with 2 private subnets and one public subnet, aws created security groups for primary/core/task. My instance sizes are m1.small. The behavior I am observing is as follows:
The instance creation process starts, and then hangs for around 1 hour before finally failing with a cryptic error of
On the master instance (i-01191fd75d02d1257), application provisioning failed
And I'm not sure what this indicates other than the obvious which is that it failed to provision. I don't want to start a job, I just want to get the primary/core nodes up and running and this error does not give me a lot to work with in figuring out my root issue.
I see the following error in the instance-controller.log
file:
AppPoller-Bg-Thread-2: Delay for 2 seconds before retry attempt 1/5 on http://10.0.152.217:8088/ws/v1/cluster/nodes
java.net.ConnectException: Connection refused (Connection refused)
But I'm not sure what this means and can't find information about it on google.
The bootstrap/master.log
file has the following:
2023-06-27 12:26:24,662 INFO i-01191fd75d02d1257: new instance started
2023-06-27 12:26:24,918 ERROR i-01191fd75d02d1257: failed to start. bootstrap action 1 failed with non-zero exit code.
2023-06-27 12:27:15,008 INFO i-093e926ba8e684d7c: new instance started
2023-06-27 12:27:20,833 INFO i-093e926ba8e684d7c: all bootstrap actions complete and instance ready
which would indicate that it's finished bootstrapping, but looking at the EMR console for this primary node and it still says Bootstrapping
so there seems to be a disconnect here.
I see this error in the /emr/instance-controller/log/hadoop-commands/
directory:
report: FileSystem file:/// is not an HDFS file system. The fs class is: org.apache.hadoop.fs.LocalFileSystem
Usage: hdfs dfsadmin [-report] [-live] [-dead] [-decommissioning] [-enteringmaintenance] [-inmaintenance]
The EMR instance state log has the following message:
Sleeping for a random period of time up to 10 minutes
Try the m6a.xlarge (3rd generation AMD EPYC processor) if you need any software/libraries requiring x86 architecture; or m7g.xlarge (AWS Graviton3 processors); instead of the M1 type which is a previous generation.