Search code examples
apache-sparkhadooppysparkapache-spark-sqlhadoop-yarn

Where are the spark intermediate files stored on the disk?


During a shuffle, the mappers dump their outputs to the local disk from where it gets picked up by the reducers. Where exactly on the disk are those files dumped? I am running pyspark cluster on YARN.

What I have tried so far:

I think the possible locations where the intermediate files could be are (In the decreasing order of likelihood):

  1. hadoop/spark/tmp. As per the documentation at the LOCAL_DIRS env variable that gets defined by the yarn. However, post starting the cluster (I am passing master --yarn) I couldn't find any LOCAL_DIRS env variable using os.environ but, I can see SPARK_LOCAL_DIRS which should happen only in case of mesos or standalone as per the documentation (Any idea why that might be the case?). Anyhow, my SPARK_LOCAL_DIRS is hadoop/spark/tmp
  2. tmp. Default value of spark.local.dir
  3. /home/username. I have tried sending custom value to spark.local.dir while starting the pyspark using --conf spark.local.dir=/home/username
  4. hadoop/yarn/nm-local-dir. This is the value of yarn.nodemanager.local-dirs property in yarn-site.xml

I am running the following code and checking for any intermediate files being created at the above 4 locations by navigating to each location on a worker node.

The code I am running:

from pyspark import storagelevel
df_sales = spark.read.load("gs://monsoon-credittech.appspot.com/spark_datasets/sales_parquet")
df_products = spark.read.load("gs://monsoon-credittech.appspot.com/spark_datasets/products_parquet")
df_merged = df_sales.join(df_products,df_sales.product_id==df_products.product_id,'inner')
df_merged.persist(storagelevel.StorageLevel.DISK_ONLY)
df_merged.count()

There are no files that are being created at any of the 4 locations that I have listed above

As suggested in one of the answers, I have tried getting the directory info in the terminal the following way:

  1. At the end of log4j.properties file located at $SPARK_HOME/conf/ add log4j.logger.or.apache.spark.api.python.PythonGatewayServer=INFO This did not help. The following is the screenshot of my terminal with logging set to INFO

enter image description here

Where are the spark intermediate files (output of mappers, persist etc) stored?


Solution

  • Without getting into the weeds of Spark source, perhaps you can quickly check it live. Something like this:

    >>> irdd = spark.sparkContext.range(0,100,1,10)                                                                                                          
    >>> def wherearemydirs(p):
    ...   import os
    ...   return os.getenv('LOCAL_DIRS')                                                                                                
    ... 
    >>> 
    >>> irdd.map(wherearemydirs).collect()
    >>>
    

    ...will show local dirs in terminal

    /data/1/yarn/nm/usercache//appcache/<application_xxxxxxxxxxx_xxxxxxx>,/data/10/yarn/nm/usercache//appcache/<application_xxxxxxxxxxx_xxxxxxx>,/data/11/yarn/nm/usercache//appcache/<application_xxxxxxxxxxx_xxxxxxx>,...

    But yes, it will basically point to the parent dir (created by YARN) of UUID-randomized subdirs created by DiskBlockManager, as @KoedIt mentioned:

    :
    23/01/05 10:15:37 INFO storage.DiskBlockManager: Created local directory at /data/1/yarn/nm/usercache/<your-user-id>/appcache/application_xxxxxxxxx_xxxxxxx/blockmgr-d4df4512-d18b-4dcf-8197-4dfe781b526a
    :