Search code examples
apache-sparkapache-spark-standalone

Who loads partitions into RAM in Spache Spark?


I have this question that I have not been able to find its answer anywhere.

I am using the following lines to load data within a PySpark application:

loadFile = self.tableName+".csv"
dfInput= self.sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load(loadFile) 

My cluster configuration is as follows:

  • I am using a Spark cluster with 3 nodes: 1 node is used to start the master, the other 2 nodes are running 1 worker each.
  • I submit the application from outside the cluster on a login node, with a script.
  • The script submits the Spark application with cluster deploy mode which I think, then in this case, makes a driver run on any of the 3 nodes I am utilising.
  • The input CSV files are stored in a globally visible temporary file system (Lustre).

In Apache Spark Standalone, how is the process of loading partitions to RAM?

  1. Is it that each executor accesses to the driver's node RAM and loads partitions to its own RAM from there? (Storage --> driver's RAM --> executor's RAM)
  2. Is it that each executor accesses to storage and loads to its own RAM? (Storage --> executor's RAM)

Is it none of these and I am missing something here? How can I witness this process by myself (monitoring tool, unix command, somewhere in Spark)?

Any comment or resource in which I can get deep into this would be very helpful. Thanks in advance.


Solution

  • The second scenario is correct:

    each executor accesses to storage and loads to its own RAM? (Storage --> executor's RAM)