Search code examples
hadoopcrashloadhortonworks-data-platform

Hortonworks Data Platform: High load causes node restart


I have setup a Hadoop Cluster with Hortonworks Data Platform 2.5. I'm using 1 master and 5 slave (worker) nodes.

Every few days one (or more) of my worker nodes gets a high load and seem to restart the whole CentOS operating system automatically. After the restart the Hadoop components don't run anymore and have to be restarted manually via the Amabri management UI.

Here a screenshot of the "crashed" node (reboot after the high load value ~4 hours ago): enter image description here

Here a screenshot of one of other "healthy" worker node (all other workers have similar values): enter image description here

The node crashes alternate between the 5 worker nodes, the master node seems to run without problems.

What could cause this problem? Where are these high load values coming from?


Solution

  • This seems to be a Kernel problem, as the log file (e.g. /var/spool/abrt/vmcore-127.0.0.1-2017-06-26-12:27:34/backtrace) says something like

    Version: 3.10.0-327.el7.x86_64
    BUG: unable to handle kernel NULL pointer dereference at 00000000000001a0
    

    After running a sudo yum update I had the kernel version

    [root@myhost ~]# uname -r
    3.10.0-514.26.2.el7.x86_64
    

    Since the operating system updates the problem didn't occur anymore. I will observe the issue and give feedback if neccessary.