During a shuffle, the mappers dump their outputs to the local disk from where it gets picked up by the reducers. Where exactly on the disk are those files dumped? I am running pyspark cluster on YARN.
What I have tried so far:
I think the possible locations where the intermediate files could be are (In the decreasing order of likelihood):
hadoop/spark/tmp
. As per the documentation at the LOCAL_DIRS
env variable that gets defined by the yarn.
However, post starting the cluster (I am passing master --yarn
) I couldn't find any LOCAL_DIRS
env variable using os.environ
but, I can see SPARK_LOCAL_DIRS
which should happen only in case of mesos or standalone as per the documentation (Any idea why that might be the case?). Anyhow, my SPARK_LOCAL_DIRS
is hadoop/spark/tmp
tmp
. Default value of spark.local.dir
/home/username
. I have tried sending custom value to spark.local.dir
while starting the pyspark using --conf spark.local.dir=/home/username
hadoop/yarn/nm-local-dir
. This is the value of yarn.nodemanager.local-dirs
property in yarn-site.xmlI am running the following code and checking for any intermediate files being created at the above 4 locations by navigating to each location on a worker node.
The code I am running:
from pyspark import storagelevel
df_sales = spark.read.load("gs://monsoon-credittech.appspot.com/spark_datasets/sales_parquet")
df_products = spark.read.load("gs://monsoon-credittech.appspot.com/spark_datasets/products_parquet")
df_merged = df_sales.join(df_products,df_sales.product_id==df_products.product_id,'inner')
df_merged.persist(storagelevel.StorageLevel.DISK_ONLY)
df_merged.count()
There are no files that are being created at any of the 4 locations that I have listed above
As suggested in one of the answers, I have tried getting the directory info in the terminal the following way:
$SPARK_HOME/conf/
add log4j.logger.or.apache.spark.api.python.PythonGatewayServer=INFO
This did not help. The following is the screenshot of my terminal with logging set to INFOWhere are the spark intermediate files (output of mappers, persist etc) stored?
Without getting into the weeds of Spark source, perhaps you can quickly check it live. Something like this:
>>> irdd = spark.sparkContext.range(0,100,1,10)
>>> def wherearemydirs(p):
... import os
... return os.getenv('LOCAL_DIRS')
...
>>>
>>> irdd.map(wherearemydirs).collect()
>>>
...will show local dirs in terminal
/data/1/yarn/nm/usercache//appcache/<application_xxxxxxxxxxx_xxxxxxx>,/data/10/yarn/nm/usercache//appcache/<application_xxxxxxxxxxx_xxxxxxx>,/data/11/yarn/nm/usercache//appcache/<application_xxxxxxxxxxx_xxxxxxx>,...
But yes, it will basically point to the parent dir (created by YARN) of UUID-randomized subdirs created by DiskBlockManager
, as @KoedIt mentioned:
:
23/01/05 10:15:37 INFO storage.DiskBlockManager: Created local directory at /data/1/yarn/nm/usercache/<your-user-id>/appcache/application_xxxxxxxxx_xxxxxxx/blockmgr-d4df4512-d18b-4dcf-8197-4dfe781b526a
: