Search code examples
apache-sparkhadoop-yarnrdd

Spark RDD.pipe run bash script as a specific user


I notice that RDD.pipe(Seq("/tmp/test.sh")) runs the shell script with the user yarn . that is problematic because it allows the spark user to access files that should only be accessible to the yarn user.

What is the best way to address this ?
Calling sudo -u sparkuser is not a clean solution . I would hate to even consider that .


Solution

  • I am not sure if this is the fault of Spark to treat the Pipe() differently, but I opened a similar issue on JIRA: https://issues.apache.org/jira/projects/SPARK/issues/SPARK-26101

    Now on to the problem. Apparently in YARN cluster Spark Pipe() asks for a container, whether your Hadoop is nonsecure or is secured by Kerberos is the difference between whether container runs by user yarn/nobody or the user who launches the container your actual user.

    Either use Kerberos to secure your Hadoop or if you don't want to go through securing your Hadoop, you can set two configs in YARN which uses the Linux users/groups to launches the container. Note, you must share the same users/groups across all the nodes in your cluster. Otherwise, this won't work. (perhaps use LDAP/AD to sync your users/groups)

    Set these:

    yarn.nodemanager.linux-container-executor.nonsecure-mode.limit-users = false
    
    yarn.nodemanager.container-executor.class = org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor
    

    Source: https://hadoop.apache.org/docs/r2.7.4/hadoop-yarn/hadoop-yarn-site/NodeManagerCgroups.html (this is the same even in Hadoop 3.0)

    This fixed worked on Cloudera latest CDH 5.15.1 (yarn-site.xml): http://community.cloudera.com/t5/Batch-Processing-and-Workflow/YARN-force-nobody-user-on-all-jobs-and-so-they-fail/m-p/82572/highlight/true#M3882

    Example:

    val test = sc.parallelize(Seq("test user")).repartition(1)
    
    val piped = test.pipe(Seq("whoami"))
    
    val c = piped.collect()
    
    est: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[4] at repartition at <console>:25
    piped: org.apache.spark.rdd.RDD[String] = PipedRDD[5] at pipe at <console>:25
    c: Array[String] = Array(maziyar)
    

    This will return the username who started the Spark session after setting those configs in yarn-site.xml and sync all the users/groups among all the nodes.