apache-spark hadoop caching hadoop-yarn amazon-emr

EMR, Spark: proper place for a local shared cache

In our Spark application, we store the local application cache in /mnt/yarn/app-cache/ directory, which is shared between app containers on the same ec2 instance

/mnt/... is chosen because it is a fast NVMe SSD on r5d instances

This approach worked well for several years on EMR 5.x - /mnt/yarn belongs to the yarn user, and apps containers run from yarn, and it can create directories

In EMR 6.x things changed - containers now run from the hadoop user which does not have write access to /mnt/yarn/

hadoop user can create directories in /mnt/, but yarn can not, and I want to keep compatibility - the app should be able to run successfully on both EMR 5.x and 6.x

java.io.tmpdir also doesn't work - it is different for each container

What should be the proper place to store cache on NVMe SSD (/mnt, /mnt1) so it can be accessible by all containers and can be operable on both EMR 5.x and 6.x?

Solution

On your EMR cluster, you can add the yarn user to the super user group; by default, this group is called supergroup. You can confirm if this is the right group by checking the dfs.permissions.superusergroup in the hdfs-site.xml file.

You could also try modifying the following HDFS properties (in the file named above): dfs.permissions.enabled or dfs.datanode.data.dir.perm.