Search code examples
hadoophigh-availabilityfederationquorum

What is difference between Hadoop Namenode HA and HDFS federation


I am a bit confused with Hadoop Namenode HA using QJM and HDFS federation. Both uses multiple namenode and both provides High Availability. I am not able to decide which architecture to used for Namenode High Availability since both looks exactly same except the QJM thing.

Please pardon me if this is not the type of question to be discussed here.


Solution

  • The main difference between HDFS High Availability and HDFS Federation would be that the namenodes in Federation aren't related to each other.

    In HDFS federation, all the namenodes share a pool of metadata in which each namenode has it's own pool hence providing fault-tolerance i.e if one namenode in a federation fails, it doesn't affect the data of other namenodes.

    So, Federation = Multiple namenodes and no correlation.

    While in case of HDFS HA, there are two namenodes - Primary NN and Standby NN. Primary NN works hard all the time, everytime while Standby NN just sits there and chills and updates it's metadata with respect to the Primary Namenode once in a while which makes them related. When Primary NN gets tired of this usual sheet (i.e it fails), the Standby NameNode takes over with whatever most recent metadata it has.

    As for a HA Architecture, you need to have atleast two sepearte machines configured as Namenode, out of which only one should run in Active State.

    More details here: HDFS High Availability