Search code examples
pythonhadoophdfsairflowsnakebite

Configure SnakeBite HDFS clients to work with high availability mode


I'm using the snakebite library to access HDFS from my airflow dags.

My HDFS cluster has been upgraded to High Availability Mode. This now means that clients configured to point to only one name node will fail when that namenode is not the active node.

What strategies can I use to make high availability mode highly available? Can I configure snakebite clients to failover to another node? Can I use some kind of loadbalancer to direct traffic to the right namenode?


Solution

  • It turns out that Snakebite has not one, but two solutions to this problem: AutoConfigClient, which will take its configs from the hadoop configs, and HAClient which takes two namenodes.

    In my case, I was actually using snakebite through airflow. It turns out that airflow's HDFSHook is smart enough to cope with two namenodes being provided in one connection, and will then use the HAClient.