In our Spark application, we store the local application cache in /mnt/yarn/app-cache/
directory, which is shared between app containers on the same ec2 instance
/mnt/...
is chosen because it is a fast NVMe SSD on r5d instances
This approach worked well for several years on EMR 5.x - /mnt/yarn
belongs to the yarn
user, and apps containers run from yarn
, and it can create directories
In EMR 6.x things changed - containers now run from the hadoop
user which does not have write access to /mnt/yarn/
hadoop
user can create directories in /mnt/
, but yarn
can not, and I want to keep compatibility - the app should be able to run successfully on both EMR 5.x and 6.x
java.io.tmpdir
also doesn't work - it is different for each container
What should be the proper place to store cache on NVMe SSD (/mnt
, /mnt1
) so it can be accessible by all containers and can be operable on both EMR 5.x and 6.x?
On your EMR cluster, you can add the yarn
user to the super user group; by default, this group is called supergroup
. You can confirm if this is the right group by checking the dfs.permissions.superusergroup
in the hdfs-site.xml
file.
You could also try modifying the following HDFS properties (in the file named above): dfs.permissions.enabled
or dfs.datanode.data.dir.perm
.