azure apache-spark apache-spark-sql azure-databricks

How does spark processing work on data from outside the cluster like azure blob storage?

My question is similar to :

Standalone Spark cluster on Mesos accessing HDFS data in a different Hadoop cluster

While the question above is about using spark to process data from a different hadoop cluster, I would also like to know how the spark processes data from azure blob storage container.

From the azure documentation (https://learn.microsoft.com/en-us/azure/databricks/data/data-sources/azure/azure-storage), the following code is used to load the data directly into a dataframe:

val df = spark.read.parquet("wasbs://<container-name>@<storage-account-name>.blob.core.windows.net/<directory-name>")

Is the complete data transfered to the driver memory and then split across executors when actions such as udf are applied on the dataframe?

Does locality play a role in how this is processed? For example if the spark cluster and the data (either on an azure blob storage container or different hadoop cluster) are located in different datacenters, how is it processed?

Solution

Is the complete data transfered to the driver memory and then split across executors when actions such as udf are applied on the dataframe?

Yes the complete data is transferred, but not to the driver. The executors read the data in parallel. If there are lots of files, they are divided among the executors, and large files are read in parallel by multiple executors (if the file format is splittable).

val df = spark.read.parquet("wasbs://@.blob.core.windows.net/")

It's critical to understand that that line of code doesn't load anything. Later when you call df.write or evaluate a Spark SQL query, the data will be read. And if the data is partitioned, the query may be able to eliminate whole partitions not needed for the query.

Does locality play a role in how this is processed?

In Azure really fast networks compensate for having data and compute separated.

Of course you generally want the Blob/Data Lake in the same Azure Region as the Spark cluster. Data movement across regions is slower, and is charged as Data Egress at a bit under $0.01/GB.