Search code examples
asynchronoushivethriftexecutionhiveql

Asynchronous hive query execution : OperationHandle gets cleaned up at server side as soon as the query initiator client disconnects


Is it possible to execute a query asynchronously in hive server?

For eg, How can I /Is it possible to do something like this from the client-

QueryHandle handle = executeAsyncQuery(hiveQuery);
Status status = handle.checkStatus();
if(status.isCompleted()) {
    QueryResult result = handle.fetchResult();
}

I also had a look at How do I make an async call to Hive in Java?. But did not help. The answers were mostly around the thrift clients taking a callback argument.

Any help would be appreciated. Thanks!

[EDIT 1]

I went through the HiveConnection.java in hive-jdbc. hive-jdbc by default uses the async thrift APIs. Hence it submits a query and polls for result sets (look at HiveStatement.java). Now i am able to write a piece of code which is purely non blocking. But the problem is as soon as the client disconnect the foot print about the query is lost.

Client 1

final TCLIService.Client client = new TCLIService.Client(createBinaryTransport(host, port, loginTimeout, sessConf, false)); // from HiveConnection.java
TSessionHandle sessionHandle = openSession(client) // from HiveConnection.java
TExecuteStatementReq execReq = new TExecuteStatementReq(sessionHandle, sql);
execReq.setRunAsync(true);
execReq.setConfOverlay(sessConf);
final TGetOperationStatusReq handle = client.ExecuteStatement(execReq)
writeHandleToFile("~/handle", handle)

Client 2

final TGetOperationStatusReq handle = readHandleFromFile("~/handle")
final TCLIService.Client client = new TCLIService.Client(createBinaryTransport(host, port, loginTimeout, sessConf, false));
while (true) {
    System.out.println(client.GetOperationStatus(handle).getOperationState());
    Thread.sleep(1000);
}

Client 2 keeps printing FINISHED_STATE as long as Client 1 is alive. But if client 1 process completes or gets killed, client 2 starts printing null which means hiveserver2 is cleaning up the resources as soon as a client disconnects.

Is it possible to configure hiveserver2 to configure this clean up process based on time or something?

Thanks!


Solution

  • Did some research and figured out that this happens only with binary transport (tcp)

      @Override
      public void deleteContext(ServerContext serverContext,
          TProtocol input, TProtocol output) {
        Metrics metrics = MetricsFactory.getInstance();
        if (metrics != null) {
          try {
            metrics.decrementCounter(MetricsConstant.OPEN_CONNECTIONS);
          } catch (Exception e) {
            LOG.warn("Error Reporting JDO operation to Metrics system", e);
          }
        }
        ThriftCLIServerContext context = (ThriftCLIServerContext) serverContext;
        SessionHandle sessionHandle = context.getSessionHandle();
        if (sessionHandle != null) {
          LOG.info("Session disconnected without closing properly, close it now");
          try {
            cliService.closeSession(sessionHandle);
          } catch (HiveSQLException e) {
            LOG.warn("Failed to close session: " + e, e);
          }
        }
      }
    

    The above stub (from ThriftBinaryCLIService) gets executed through this piece of code from TThreadPoolServer which is used by ThriftBinaryCLIService.

    eventHandler.deleteContext(connectionContext, inputProtocol, outputProtocol);

    Apparently http transport (ThriftHttpCLIService) has a different strategy of cleaning up operation handles (not greedy like tcp)

    Will check with hive community on this to understand a bit more and see if there is an issue addressing this already.