Search code examples
apache-samza

Where does Samza on YARN place its KV state stores?


I need to find where Samza on YARN places its KV state stores. I suspect it is in the YARN local application directory as all YARN applications but I believe it is configurable as I did this a few months back (mapped folder to memory) in a different environment but don't recall now.

For that to be possible I need to be able to separate the samza KV stores from other YARN application data of other applications.


Solution

  • Here's the solution. It was printed in the Samza job log output:

    [WARN] No override was provided for logged store base directory. This disables local state re-use on application restart. If you want to enable this feature, set LOGGED_STORE_BASE_DIR as an environment variable in all machines running the Samza container

    LOGGED_STORE_BASE_DIR can be set as part of the NodeManager startup. For example:

    # Typical environment setup.
    export JAVA_HOME=...
    export YARN_CONF_DIR=...
    export YARN_LOG_DIR=...
    export HADOOP_LOG_DIR=...
    export YARN_MASTER=...
    export YARN_PID_DIR=...
    export YARN_IDENT_STRING=...
    export YARN_NICENESS=...
    export YARN_OPTS="-XX:+UseG1GC -XX:ErrorFile=logs/hs_err.log -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+PrintTenuringDistribution -XX:ErrorFile=logs/hs_err.log -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintTenuringDistribution -Xloggc:logs/gc.log"
    
    # Location of samza-kv stores for host affinity (should be on an SSD).
    export LOGGED_STORE_BASE_DIR="/mnt/myssd/samza/logged-stores"
    
    # Startup the Yarn NodeManager
    ./yarn-daemon.sh" --config "$YARN_CONF_DIR" nodemanager