Search code examples
hadoophdfshortonworks-data-platformhigh-availabilitybigdata

Hadoop HA Namenode goes down with the Error: flush failed for required journal (JournalAndStream(mgr=QJM to [< ip >:8485, < ip >:8485, < ip >:8485]))


Hadoop Namenode goes down almost everyday once.

FATAL namenode.FSEditLog (JournalSet.java:mapJournalsAndReportErrors(398)) - 

**Error: flush failed for required journal** (JournalAndStream(mgr=QJM to [< ip >:8485, < ip >:8485, < ip >:8485], stream=QuorumOutputStream starting at txid <>))
java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to respond.
    at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:137)
    at org.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.flushAndSync(QuorumOutputStream.java:107)
    at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:113)
    at 

Can someone suggest what are the things that I need to look into for resolving this issue?

I am using VMs for the journal nodes and master nodes. Does it cause any issue?


Solution

  • In my case, this issue was caused due to the difference in the system time between the nodes of the cluster.

    To keep the system time in sync, we can execute the commands below in each node.

    sudo service ntpd stop
    
    sudo ntpdate pool.ntp.org  # Run this command multiple times
    
    sudo service ntpd start
    

    If hue is down, run below command on the hue server machine

    sudo service hue start
    

    If namenode is down, start the namenode.

    Recurring fix

    Add a crontab for the root user on all the nodes of the environment.

    or

    Install VM tools, to keep the system time in sync.