I have Spark set up in standalone mode on a single node with 2 cores and 16GB of RAM to make some rough POCs.
I want to load data from a SQL source using val df = spark.read.format('jdbc')...option('numPartitions',n).load()
. When I tried to measure the time taken to read a table for different numPartitions
values by calling a df.rdd.count
, I saw the the time was the same regardless of the value I gave. I also noticed one the context web UI that the number of Active executors was 1, even though I gave SPARK_WORKER_INSTANCES=2
and SPARK_WORKER_CORES=1
in my spark_env.sh file.
I have 2 questions:
Do the numPartitions
actually created depend on the number of executors?
How do I start spark-shell with multiple executors in my current setup?
Thanks!
Number of partitions doesn't depend on your number of executors - althaugh there is best practice (partitions per cores), but it doesn't determined by the executors instances.
In case of reading from JDBC, to make it parallelize reading you need a partition column, e.g:
spark.read("jdbc")
.option("url", url)
.option("dbtable", "table")
.option("user", user)
.option("password", password)
.option("numPartitions", numPartitions)
.option("partitionColumn", "<partition_column>")
.option("lowerBound", 1)
.option("upperBound", 10000)
.load()
That will parallel the queries from the databases to 10,000/numPartitions results of each query.
About your second question, you can find all over spark configuration over here: https://spark.apache.org/docs/latest/configuration.html , (spark2-shell --num-executors, or the configuration --conf spark.executor.instances).
Specifing the number of the executors meaning dynamic allocation will be off so be aware of that.