Search code examples
apache-sparkhbasekerberos

java.io.IOException: Login failure for [email protected] from keytab


I wrote a program by using spark streaming to insert data to kerberos enabled hbase. In one batch, I met one failed task. The error is below:

java.io.IOException: Login failure for [email protected] from keytab ./user.keytab
    at org.apache.hadoop.security.UserGroupInformation.loginUserFromKeytabAndReturnUGI(UserGroupInformation.java:1160)
    at com.framework.common.HbaseUtil$.InsertToHbase(HbaseUtil.scala:81)
    at com.framework.realtime.RDDUtil$$anonfun$dwsTodwd$2.apply(RDDUtil.scala:203)
    at com.framework.realtime.RDDUtil$$anonfun$dwsTodwd$2.apply(RDDUtil.scala:202)
    at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$33.apply(RDD.scala:920)
    at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$33.apply(RDD.scala:920)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)
Caused by: javax.security.auth.login.LoginException: Receive timed out
    at com.sun.security.auth.module.Krb5LoginModule.attemptAuthentication(Krb5LoginModule.java:767)
    at com.sun.security.auth.module.Krb5LoginModule.login(Krb5LoginModule.java:584)
    at sun.reflect.GeneratedMethodAccessor18.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at javax.security.auth.login.LoginContext.invoke(LoginContext.java:762)
    at javax.security.auth.login.LoginContext.access$000(LoginContext.java:203)
    at javax.security.auth.login.LoginContext$4.run(LoginContext.java:690)
    at javax.security.auth.login.LoginContext$4.run(LoginContext.java:688)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.login.LoginContext.invokePriv(LoginContext.java:687)
    at javax.security.auth.login.LoginContext.login(LoginContext.java:595)
    at org.apache.hadoop.security.UserGroupInformation.loginUserFromKeytabAndReturnUGI(UserGroupInformation.java:1149)
    ... 13 more
Caused by: java.net.SocketTimeoutException: Receive timed out
    at java.net.PlainDatagramSocketImpl.receive0(Native Method)
    at java.net.AbstractPlainDatagramSocketImpl.receive(AbstractPlainDatagramSocketImpl.java:146)
    at java.net.DatagramSocket.receive(DatagramSocket.java:816)
    at sun.security.krb5.internal.UDPClient.receive(NetClient.java:207)
    at sun.security.krb5.KdcComm$KdcCommunication.run(KdcComm.java:390)
    at sun.security.krb5.KdcComm$KdcCommunication.run(KdcComm.java:343)
    at java.security.AccessController.doPrivileged(Native Method)
    at sun.security.krb5.KdcComm.send(KdcComm.java:327)
    at sun.security.krb5.KdcComm.send(KdcComm.java:219)
    at sun.security.krb5.KdcComm.send(KdcComm.java:191)
    at sun.security.krb5.KrbAsReqBuilder.send(KrbAsReqBuilder.java:319)
    at sun.security.krb5.KrbAsReqBuilder.action(KrbAsReqBuilder.java:364)
    at com.sun.security.auth.module.Krb5LoginModule.attemptAuthentication(Krb5LoginModule.java:735)
    ... 25 more

But in the second attempt,the task succeed. In my opinion,the certification process is too long, so it fails, and in another attempt, the process is short. So it scceed. Am I correct? If so or not, how to solve this problem please? My code is as below:

val ugi = UserGroupInformation.loginUserFromKeytabAndReturnUGI(princ,
      keytab)

    ugi.doAs(new PrivilegedAction[Unit]() {
      def run(): Unit = {
        // TODO Auto-generated method stub
        var conn: HConnection = null
        var htable: HTableInterface = null

          conn = HConnectionManager.createConnection(conf)
          htable = conn.getTable(tableName)
          htable.setAutoFlushTo(false)
          for (record <- partitionOfRecords) {
             htable.put(record)
          }
      }
    })

Solution

  • From Hadoop and Kerberos - the Madness beyond the Gate chapter "Error Messages to Fear"...

    Receive timed out

    Usually in a stack trace like

    Caused by: java.net.SocketTimeoutException: Receive timed out
    at java.net.PlainDatagramSocketImpl.receive0(Native Method)
    ...
    at sun.security.krb5.internal.UDPClient.receive(NetClient.java:207)

    ... UDP socket ... Switch to TCP —at the very least, it will fail faster.

    And just above that:

    Switching kerberos to use TCP rather than UDP
    In /etc/krb5.conf:

    [libdefaults]
    udp_preference_limit = 1


    Generally speaking, many erratic Kerberos issues seem to occur only with UDP, so it's unfortunate that it's used by default...


    Note that Java also supports kdc_timeout configuration parameter, but it's a dirty mess:

    • not mentioned in MIT Kerberos documentation
    • not mentioned in Unix/Linux documentation except for BSD
    • mentioned only in the darkest corners of Java documentation, here for Java 9, with an interesting side note about the fact that the default value has changed from 30s-expressed-implicitly-in-milliseconds to 30s at some point
    • a few weeks ago, the Cloudera support team issued a recommendation about that setting -- because the 30s default timeout could create cascading failures in HDFS High Availability or something like that -- but the poor guys did not really know what they were recommending, so they suggested randomly "3" or "3s" or "3000" for the explicit timeout value


    Note also that if you have multiple KDCs for high availability, and these KDCs are explicitly listed in krb5.conf (or implicitly listed via a DNS alias set with a round-robin rule, for example) then in case of "KDC timeout" Java should retry with the next KDC in line. Unless you have reached a global time-out.