I want to compare the query performance between Spark and Drill. Therefore, the configuration of these two systems has to be identical. What are the parameters I have to consider like driver memory
, executor memory
for spark, drill max direct memory
, planner memory max query memory per node
for Drill etc? Can someone give me an example of configuration?
It is possible to get a close comparison between Spark and Drill for specific overlapping use case. I will first describe how Spark and Drill are different, what the overlapping use cases are, and finally how you could tune Spark's memory settings to match Drill as closely as possible for overlapping use cases.
Both Spark and Drill can function as a SQL compute engine. My definition of a SQL compute engine is a system that can do the following:
Drill is only a SQL compute engine while Spark can do more than just a SQL compute engine. The extra things that Spark can do are the following:
So to accurately compare Drill and Spark you can only consider their overlapping functionality, which is executing a SQL statement.
A running Spark job is comprised of two types of nodes. An Executor and a Driver. The Executor is like a worker node that is given simple compute tasks and executes them. The Driver orchestrates a Spark job. For example if you have a SQL query or a Spark job written in Python, the Driver is responsible for planning how the work for the SQL query or python script will be distributed to the Executors. The Driver will then monitor the work being done by Executors. The Driver can be run in a variety of modes: on your laptop like a client, on a separate dedicated node or container.
Drill is slightly different. The two participants in a SQL query are the Client and Drillbit. The Client is essentially a dummy commandline terminal for sending SQL commands and receiving results. The Drillbits are responsible for doing the compute work for a query. When the Client sends a SQL command to Drill the client will pick one Drillbit to be a Foreman. There is no restriction on which Drillbit can be a foreman and there can be a different Foreman selected for each query. The Foreman performs two functions during the query:
The functions of Spark's Driver and Executors are very similar to Drill's Drillbit and Foreman but not quite the same. The main difference being that a Driver cannot function as an Executor simultaneusly, while a Foreman also functions as a Drillbit.
When constructing a cluster comparing Spark and Drill I would do the following:
Spark and Drill both use the JVM. Applications running on the JVM have access to two kinds of memory. On Heap Memory and Off Heap Memory. On heap memory is normal garbage collected memory; for example if you do new Object()
the object will be allocated on the heap. Off heap memory is not garbage collected and must be explicitly allocated and freed. When applications consume a large amounts of heap memory (16 GB or more), they can tax JVM garbage collector. In such cases garbage collection can incur a significant compute overhead and depending on the GC algorithm computation can pause for several seconds as garbage collection is done. In contrasts off heap memory is not subject to garbage collection and would not incur these performance penalties.
Spark stores everything on heap by default. It can be configured to store some data in off heap memory, but it is not clear to me when it will actually store data off heap.
Drill stores all its data in off heap memory, and only uses on heap memory for the general engine itself.
Another additional difference is that Spark reserves some of its memory to cache DataSets, while Drill does not caching of data in memory after a query is executed.
In order to compare Spark and Drill apples to apples we would have to configure Spark and Drill to use the same amount of off heap and on heap memory for executing a SQL query. In the following example we will walk through how to configure Drill and spark to use 8gb of on heap memory and 8gb of off heap memory.
Set the following in your drill-env.sh file on each Drillbit
export DRILL_HEAP="8G"
export DRILL_MAX_DIRECT_MEMORY="8G"
Once these are configured restart your Drillbits and try your query. Your query may run out of memory because Drill's memory management is still under active development. In order to give you an out you can manually control Drill's memory usage with a query using the planner.width.max_per_node and planner.memory.max_query_memory_per_node options. These options are set in your drill-override.conf. Note you must change these options on all your nodes and restart your Drillbits for them to take effect. A more detailed explanation of these options can be found here.
Create a properties file myspark.conf and pass it to the spark submit command. The spark properties file should include the following config.
# 8gb of heap memory for executor
spark.executor.memory 8g
# 8gb of heap memory for driver
spark.driver.memory 8g
# Enable off heap memory and use 8gb of it
spark.memory.offHeap.enabled true
spark.memory.offHeap.size 8000000000
# Do not set aside memory for caching data frames
# Haven't tested if 0.0 works. If it doesn't make this
# as small as possible
spark.memory.storageFraction 0.0
Create a Drill cluster with N nodes, a Spark cluster with N executors and deploy a dedicated driver, try the memory configurations provided above, and run the same or a similar SQL query on both clusters. Hope this helps.