I need to find where Samza on YARN places its KV state stores. I suspect it is in the YARN local application directory as all YARN applications but I believe it is configurable as I did this a few months back (mapped folder to memory) in a different environment but don't recall now.
For that to be possible I need to be able to separate the samza KV stores from other YARN application data of other applications.
Here's the solution. It was printed in the Samza job log output:
[WARN] No override was provided for logged store base directory. This disables local state re-use on application restart. If you want to enable this feature, set LOGGED_STORE_BASE_DIR as an environment variable in all machines running the Samza container
LOGGED_STORE_BASE_DIR
can be set as part of the NodeManager startup. For example:
# Typical environment setup.
export JAVA_HOME=...
export YARN_CONF_DIR=...
export YARN_LOG_DIR=...
export HADOOP_LOG_DIR=...
export YARN_MASTER=...
export YARN_PID_DIR=...
export YARN_IDENT_STRING=...
export YARN_NICENESS=...
export YARN_OPTS="-XX:+UseG1GC -XX:ErrorFile=logs/hs_err.log -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+PrintTenuringDistribution -XX:ErrorFile=logs/hs_err.log -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintTenuringDistribution -Xloggc:logs/gc.log"
# Location of samza-kv stores for host affinity (should be on an SSD).
export LOGGED_STORE_BASE_DIR="/mnt/myssd/samza/logged-stores"
# Startup the Yarn NodeManager
./yarn-daemon.sh" --config "$YARN_CONF_DIR" nodemanager