Search code examples
scalaapache-sparkhadooppersist

Where is my sparkDF.persist(DISK_ONLY) data stored?


I want to understand more about the persisting strategy of hadoop out of spark.

When I persist a dataframe with the DISK_ONLY-strategy where is my data stored (path/folder...)? And where do I specify this location?


Solution

  • To sum it up for my YARN environment:

    With the guide of @stefanobaghino i was able to just go one step further in the code where the yarn config is loaded.

    val localDirs = Option(conf.getenv("LOCAL_DIRS")).getOrElse("")
    

    which is set in the yarn.nodemanager.local-dirs option in yarn-default.xml

    The background for my question is, that caused by the error

    2018-01-23 16:57:35,229 WARN org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection: Directory /data/1/yarn/local error, used space above threshold of 98.5%, removing from list of valid directories
    

    my spark-job got killed sometimes and I'd like to understand whether this disk is also used for my persisted data while running the job (which is actually a massive amount).

    So it turns out that this is exactly the folder where the data goes to when persisting it with a DISK-strategy.

    Thanks a lot for all your helpful guidance in this problem!