Search code examples
apache-sparkpartitioningexternal-data-source

How is data read parallelly in Spark from an external data source?


I am new to Spark and going through the Learning Spark Journal. Had queries regarding a concept for data fetching/reading.

If I am having an external data source (not partitioned) and I want to process that in Spark.

  1. On the basis of the below example, we are creating 10 strides/partitions, so each of the partition will read the entire external data source and get the filtered data??

Partitioning Properties

So the first partition, will scan the entire data source and store the data BETWEEN 1000 AND 2000

then the second partition will again scan the entire data source and store the date BETWEEN 2000 AND 3000?

  1. Also will there be 10 separate spark sessions to handle them in parallel? If not then how will single session read them in parallel?

  2. And each partitions which be stored in separate executors?

Tried searching over net but was not able to get a satisfactory explanation to my doubt.


Solution

    1. Each of the partition will ask the external data source to filter data at source, and send only the required data as per the condition, how the source implements is not spark's concern. in your case if it's not partitioned then the source will have to scan all the data to get to the required rows
    2. There will be single spark session, multiple threads will open connection to your external source with this query, and all those things will be converted to rdd partition
    3. Maybe /maynot be, it's going to be agnostic whichever executor has avialable cpus, it'll go there