Search code examples
apache-sparkpysparkhbaseapache-zookeeperthrift

HBase client for Spark cannot be authenticated in ZooKeeper using SASL


I'm processing HBase tables from Spark (EMR, in Yarn mode). Actually, PySpark - I don't think it is important. I call HBase through separate Thrift service from outside of the HBase cluster.

It looks like I was able to connect to the Thrift servers but I have some issue with ZooKeeper (because of the error point me to ZooKeeper port 2181).

Why does that happen and how can I fix that?

17/08/02 20:21:31 INFO ZooKeeper: Client environment:java.io.tmpdir=/tmp
17/08/02 20:21:31 INFO ZooKeeper: Client environment:java.compiler=<NA>
17/08/02 20:21:31 INFO ZooKeeper: Client environment:os.name=Linux
17/08/02 20:21:31 INFO ZooKeeper: Client environment:os.arch=amd64
17/08/02 20:21:31 INFO ZooKeeper: Client environment:os.version=4.4.35-33.55.amzn1.x86_64
17/08/02 20:21:31 INFO ZooKeeper: Client environment:user.name=hadoop
17/08/02 20:21:31 INFO ZooKeeper: Client environment:user.home=/home/hadoop
17/08/02 20:21:31 INFO ZooKeeper: Client environment:user.dir=/home/hadoop/data
17/08/02 20:21:31 INFO ZooKeeper: Initiating client connection, connectString=thrift-internal.production.k8s.prod.node.io:2181 sessionTimeout=180000 watcher=org.apache.hadoop.hbase.zookeeper.PendingWatcher@2818bc0e
17/08/02 20:21:31 INFO ClientCnxn: Opening socket connection to server ip-172-23-115-152.us-west-2.compute.internal/172.23.115.152:2181. Will not attempt to authenticate using SASL (unknown error)

Solution

  • As an HBase client, you have to connect to both: HBase service (directly or through Thrift) and ZooKeeper service (which usually runs on the same server as HBase Master).

    When you connect to HBase using Thrift servers the library uses the same host address to communicate to ZooKeeper.

    hbase = happybase.Connection(host, port=port, timeout=10000)
    

    However, this ZooKeeper address is not correct if Thrift servers work on a separate hardware/IPs.

    So, you have to connect to Thrift using the regular code

    hbase = happybase.Connection(host, port=port, timeout=10000)
    

    but specify HBaseHost (ZooKeeper) when you connect to a table by hbase.zookeeper.quorum parameter:

       conf = {"hbase.zookeeper.quorum": HBaseHost, "hbase.mapreduce.inputtable": table}
       rdd = spark_context.newAPIHadoopRDD(
            "org.apache.hadoop.hbase.mapreduce.TableInputFormat",
            "org.apache.hadoop.hbase.io.ImmutableBytesWritable",
            "org.apache.hadoop.hbase.client.Result",
            keyConverter=keyConv,
            valueConverter=valueConv,
            conf=conf
        )
    

    The ZooKeeper address might be also specified in hbase-site.xml as hbase.zookeeper.quorum property. Then you need to include this config file in your's HBase client settings.