hadoop hdfs hortonworks-data-platform high-availability bigdata

Hadoop HA Namenode goes down with the Error: flush failed for required journal (JournalAndStream(mgr=QJM to [< ip >:8485, < ip >:8485, < ip >:8485]))

Hadoop Namenode goes down almost everyday once.

FATAL namenode.FSEditLog (JournalSet.java:mapJournalsAndReportErrors(398)) - 

**Error: flush failed for required journal** (JournalAndStream(mgr=QJM to [< ip >:8485, < ip >:8485, < ip >:8485], stream=QuorumOutputStream starting at txid <>))
java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to respond.
    at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:137)
    at org.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.flushAndSync(QuorumOutputStream.java:107)
    at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:113)
    at

Can someone suggest what are the things that I need to look into for resolving this issue?

I am using VMs for the journal nodes and master nodes. Does it cause any issue?

Solution

In my case, this issue was caused due to the difference in the system time between the nodes of the cluster.

To keep the system time in sync, we can execute the commands below in each node.

sudo service ntpd stop

sudo ntpdate pool.ntp.org  # Run this command multiple times

sudo service ntpd start

If hue is down, run below command on the hue server machine

sudo service hue start

If namenode is down, start the namenode.

Recurring fix

Add a crontab for the root user on all the nodes of the environment.

Install VM tools, to keep the system time in sync.