Search code examples
sessionhbaseapache-zookeeperreconnect

HBase establishes session with ZooKeeper and close the session immediately


I have found that our RegionServers connect to the ZooKeeper frequently. They seems to constantly establish the session, close it and reconnect the ZooKeeper. Here is the log for both server and client sides. I have no idea why this happens and how to deal with it? We're using HBase 0.94.11 and ZooKeeper 3.4.4.

The log from HBase RegionServer:

2014-09-18,16:38:17,867 INFO org.apache.zookeeper.ZooKeeper: Initiating client connection, connectString=10.2.201.74:11000,10.2.201.73:11000,10.101.10.67:11000,10.101.10.66:11000,10.2.201.75:11000 sessionTimeout=30000 watcher=catalogtracker-on-org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation@69d892a1
2014-09-18,16:38:17,868 INFO org.apache.zookeeper.client.ZooKeeperSaslClient: Client will use GSSAPI as SASL mechanism.
2014-09-18,16:38:17,868 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server lg-hadoop-srv-ct01.bj/10.2.201.73:11000. Will attempt to SASL-authenticate using Login Context section 'Client'
2014-09-18,16:38:17,868 INFO org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: The identifier of this process is [email protected]
2014-09-18,16:38:17,868 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to lg-hadoop-srv-ct01.bj/10.2.201.73:11000, initiating session
2014-09-18,16:38:17,870 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server lg-hadoop-srv-ct01.bj/10.2.201.73:11000, sessionid = 0x248782700e52b3c, negotiated timeout = 30000
2014-09-18,16:38:17,876 INFO org.apache.zookeeper.ZooKeeper: Session: 0x248782700e52b3c closed
2014-09-18,16:38:17,876 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down
2014-09-18,16:38:17,878 INFO org.apache.hadoop.hbase.replication.regionserver.ReplicationSink: Total replicated: 24

The log from its ZooKeeper server:

2014-09-18,16:38:17,869 INFO org.apache.zookeeper.server.NIOServerCnxnFactory: [myid:2] Accepted socket connection from /10.2.201.76:55621
2014-09-18,16:38:17,869 INFO org.apache.zookeeper.server.ZooKeeperServer: [myid:2] Client attempting to establish new session at /10.2.201.76:55621
2014-09-18,16:38:17,870 INFO org.apache.zookeeper.server.ZooKeeperServer: [myid:2] Established session 0x248782700e52b3c with negotiated timeout 30000 for client /10.2.201.76:55621
2014-09-18,16:38:17,872 INFO org.apache.zookeeper.server.auth.SaslServerCallbackHandler: [myid:2] Successfully authenticated client: authenticationID=hbase_srv/[email protected];  authorizationID=hbase_srv/[email protected].
2014-09-18,16:38:17,872 INFO org.apache.zookeeper.server.auth.SaslServerCallbackHandler: [myid:2] Setting authorizedID: hbase_srv
2014-09-18,16:38:17,872 INFO org.apache.zookeeper.server.ZooKeeperServer: [myid:2] adding SASL authorization for authorizationID: hbase_srv
2014-09-18,16:38:17,877 INFO org.apache.zookeeper.server.NIOServerCnxn: [myid:2] Closed socket connection for client /10.2.201.76:55621 which had sessionid 0x248782700e52b3c

Solution

  • Finally I have found the root cause.

    Yes, it's about ReplicationSink and I have found the log, "2014-09-23,14:58:01,736 DEBUG org.apache.hadoop.hbase.replication.regionserver.ReplicationSink: Replicating for table online_miliao_recent".

    Then I look at the relevant code and found that every time it calls replicateEntries(), it invokes sharedHBaseAdmin.tableExists(table) as well.

    The sharedHBaseAdmin.tableExists() will create a new CatalogTracker object which is also a ZooKeeper client.

    When this method exits, it will cleanup the ZooKeeper client and the session.

    So this log looks reasonable because the Replication is running. But the tableExists() is a little heavy and I don't think we should invoke it for each time I replicate enties. I also notice that CatalogTracker is not in ReplicationSink after 0.94.11 so it's not a problem for the later versions.

    It would be great if I have found the jira which removes the CatalogTracker from ReplicationSink :-)