mapreduce hadoop-yarn hadoop2 resourcemanager

Resource manager does not transit to active state from standby

One spark job was running for more than 23 days and eventually caused the resource manager to crash. After restarting the resource manager istance (there are two of them in our cluster) both of them stayed in standby state.

And we are getting this error:

ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager Failed to load/recover state org.apache.hadoop.yarn.exceptions.YarnException: Application with id application_1470300000724_40101 is already present! Cannot add a duplicate!

We could not kill 'application_1470300000724_40101' from yarn as the resource manager is not working. So we killed all the instances from Unix level on all nodes but dint work. We have tried rebooting all nodes and still the same.

Somewhere one entry of that job is still there and preventing the resource manager to get elected as active. We are using cloudera 5.3.0 and I can see that this issue has been addressed and resolved in cloudera 5.3.3. But at this moment we need a workaround to get past for now.

Solution

To resolve this issue we can format RMStateStore by executing the below command:

yarn resourcemanager -format-state-store

But be careful as this will clear all the application history that were executed before executing this command.