Search code examples
amazon-ec2amazon-emr

Do EC2 system reboot events cause EMR to wait for the node or to replace it?


I have a long running EMR cluster. I received EC2 event notifications of upcoming system reboots. The help document advises that even rebooting these manually will not reschedule this, though stopping and starting the instances might.

The EMR cluster claims if a core node goes unresponsive it will provision a new one. I suspect this provisioning takes longer than a reboot, so what I cannot find in the documentation is whether the EC2 event is known to EMR and the cluster will wait for it's missing core nodes (or task nodes) to reboot and rejoin, or whether EMR will respond as though these instances disappeared un-expectantly, and thus will start provisioning new replacements even as the nodes come back and rejoin the cluster.

Does anyone know which it will be?


Solution

  • It turns out the AWS service person operating the HW replacement and rebooting the instance was tasked with doing the correct adjustments in EMR for changing the instance. They started by adding a node, then draining the old node of tasks. Then they rebooted the node and it was reattached to EMR. Then they drained the added node and shut that down.

    I'm not sure this is happening every time there's a reboot event though. It seems like the script of service steps is modified for different types of case.