I notice that RDD.pipe(Seq("/tmp/test.sh")) runs the shell script with the user yarn . that is problematic because it allows the spark user to access files that should only be accessible to the yarn user.
What is the best way to address this ?
Calling sudo -u sparkuser is not a clean solution . I would hate to even consider that .
I am not sure if this is the fault of Spark to treat the Pipe() differently, but I opened a similar issue on JIRA: https://issues.apache.org/jira/projects/SPARK/issues/SPARK-26101
Now on to the problem. Apparently in YARN cluster Spark Pipe() asks for a container, whether your Hadoop is nonsecure or is secured by Kerberos is the difference between whether container runs by user yarn/nobody
or the user who launches the container your actual user
.
Either use Kerberos to secure your Hadoop or if you don't want to go through securing your Hadoop, you can set two configs in YARN which uses the Linux users/groups to launches the container. Note, you must share the same users/groups across all the nodes in your cluster. Otherwise, this won't work. (perhaps use LDAP/AD to sync your users/groups)
Set these:
yarn.nodemanager.linux-container-executor.nonsecure-mode.limit-users = false
yarn.nodemanager.container-executor.class = org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor
Source: https://hadoop.apache.org/docs/r2.7.4/hadoop-yarn/hadoop-yarn-site/NodeManagerCgroups.html (this is the same even in Hadoop 3.0)
This fixed worked on Cloudera latest CDH 5.15.1 (yarn-site.xml): http://community.cloudera.com/t5/Batch-Processing-and-Workflow/YARN-force-nobody-user-on-all-jobs-and-so-they-fail/m-p/82572/highlight/true#M3882
Example:
val test = sc.parallelize(Seq("test user")).repartition(1)
val piped = test.pipe(Seq("whoami"))
val c = piped.collect()
est: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[4] at repartition at <console>:25
piped: org.apache.spark.rdd.RDD[String] = PipedRDD[5] at pipe at <console>:25
c: Array[String] = Array(maziyar)
This will return the username
who started the Spark session after setting those configs in yarn-site.xml
and sync all the users/groups among all the nodes.