Search code examples
hadoopapache-sparkhadoop-yarn

Access a secured Hive when running Spark in an unsecured YARN cluster


We have two cloudera 5.7.1 clusters, one secured using Kerberos and one unsecured.

Is it possible to run Spark using the unsecured YARN cluster while accessing hive tables stored in the secured cluster? (Spark version is 1.6)

If so, can you please provide some explanation on how can I get it configured?

Update:

I want to explain a little the end goal behind my question. Our main secured cluster is heavily utilized and our jobs can't get enough resources to complete in a reasonable time. In order to overcome this, we wanted to use resources from another unsecured cluster we have without needing to copy the data between the clusters.

We know it's not the best solution as the data locality level might not be optimal, however that's the best solution we can come up for now.

Please let me know if you have any other solution as it seems like we can't achieve the above.


Solution

  • If you run Spark in local mode, you can make it use an arbitrary set of Hadoop conf files -- i.e. core-site.xml, hdfs-site.xml, mapred-site.xml, yarn-site.xml, hive-site.xml copied from the Kerberized cluster.
    So you can access HDFS on that cluster -- if you have a Kerberos ticket that grants you access to that cluster, of course.

      export HADOOP_CONF_DIR=/path/to/conf/of/remote/kerberized/cluster
      kinit sylvestre@WORLD.COMPANY
      spark-shell --master local[*]
    

    But in yarn-client or yarn-cluster mode, you cannot launch containers in the local cluster and access HDFS in the other.

    • either you use the local core-site.xml that says that hadoop.security.authentication is simple, and you can connect to local YARN/HDFS
    • or you point to a copy of the remote core-site.xml that says that hadoop.security.authentication is kerberos, and you can connect to remote YARN/HDFS
    • but you cannot use the local, unsecure YARN and access the remote, secure HDFS

    Note that with unsecure-unsecure or secure-secure combinations, you could access HDFS in another cluster, by hacking your own custom hdfs-site.xml to define multiple namespaces. But you are stuck to a single authentication model.
    [edit] see the comment by Mighty Steve Loughran about an extra Spark property to access remote, secure HDFS from a local, secure cluster.

    Note also that with DistCp you are stuck the same way -- except that there's a "cheat" property that allows you to go from secure to unsecure.