Search code examples
hadoophdfsmesosmarathondcos

HDFS resiliency to machine restarts in DC/OS


I have installed HDFS from universe on my DCOS cluster of 10 Core OS machines (3 master nodes, 7 agent nodes). My HA HDFS config has 2 name nodes, 3 journal nodes and 5 data nodes. Now, my question is. Shouldn’t the HDFS be resilient to machine restarts? If I restart a machine where a data node is installed the data node gets rebuilt as a mirror of the others (only after restarting the HDFS service from the DC/OS UI). In the case of a restart where a journal node or a name node is, the nodes will be just marked as lost and never rebuilt.


Solution

  • Eventually the problem was found in a buggy version of the universe HDFS package for DC/OS. However, a completely new HDFS package for DC/OS will be released on Universe in the next few weeks.

    https://dcos-community.slack.com/archives/data-services/p1485717889001709

    https://dcos-community.slack.com/archives/data-services/p1485801481001734