What is the equivalent configuration between Spark and Drill?

I want to compare the query performance between Spark and Drill. Therefore, the configuration of these two systems has to be identical. What are the parameters I have to consider like driver memory, executor memory for spark, drill max direct memory, planner memory max query memory per node for Drill etc? Can someone give me an example of configuration?

Solution

It is possible to get a close comparison between Spark and Drill for specific overlapping use case. I will first describe how Spark and Drill are different, what the overlapping use cases are, and finally how you could tune Spark's memory settings to match Drill as closely as possible for overlapping use cases.

Comparison of Functionality

Both Spark and Drill can function as a SQL compute engine. My definition of a SQL compute engine is a system that can do the following:

Ingest data from files, databases, or message queue.
Execute SQL statements provided by a user on the ingested data.
Write the results of a user's SQL statement to a terminal, file, database table, or message queue.

Drill is only a SQL compute engine while Spark can do more than just a SQL compute engine. The extra things that Spark can do are the following:

Spark has APIs to manipulate data with functional programming operations, not just SQL.
Spark can save results of operations to DataSets. DataSets can be efficiently reused in other operations and are efficiently cached both on disk and in memory.
Spark has some stream processing concepts APIs.

So to accurately compare Drill and Spark you can only consider their overlapping functionality, which is executing a SQL statement.

Comparison of Nodes

A running Spark job is comprised of two types of nodes. An Executor and a Driver. The Executor is like a worker node that is given simple compute tasks and executes them. The Driver orchestrates a Spark job. For example if you have a SQL query or a Spark job written in Python, the Driver is responsible for planning how the work for the SQL query or python script will be distributed to the Executors. The Driver will then monitor the work being done by Executors. The Driver can be run in a variety of modes: on your laptop like a client, on a separate dedicated node or container.

Drill is slightly different. The two participants in a SQL query are the Client and Drillbit. The Client is essentially a dummy commandline terminal for sending SQL commands and receiving results. The Drillbits are responsible for doing the compute work for a query. When the Client sends a SQL command to Drill the client will pick one Drillbit to be a Foreman. There is no restriction on which Drillbit can be a foreman and there can be a different Foreman selected for each query. The Foreman performs two functions during the query:

He plans the query and orchestrates the rest of the Drillbits to divide up the work.
He also participates in the execution of the query and does some of the data processing as well.

The functions of Spark's Driver and Executors are very similar to Drill's Drillbit and Foreman but not quite the same. The main difference being that a Driver cannot function as an Executor simultaneusly, while a Foreman also functions as a Drillbit.

When constructing a cluster comparing Spark and Drill I would do the following:

Drill: Create a cluster with N nodes.
Spark: Create a cluster with N Executors and make sure the Driver has the same amount of memory as the Executors.

Comparison of Memory Models

Spark and Drill both use the JVM. Applications running on the JVM have access to two kinds of memory. On Heap Memory and Off Heap Memory. On heap memory is normal garbage collected memory; for example if you do new Object() the object will be allocated on the heap. Off heap memory is not garbage collected and must be explicitly allocated and freed. When applications consume a large amounts of heap memory (16 GB or more), they can tax JVM garbage collector. In such cases garbage collection can incur a significant compute overhead and depending on the GC algorithm computation can pause for several seconds as garbage collection is done. In contrasts off heap memory is not subject to garbage collection and would not incur these performance penalties.

Spark stores everything on heap by default. It can be configured to store some data in off heap memory, but it is not clear to me when it will actually store data off heap.

Drill stores all its data in off heap memory, and only uses on heap memory for the general engine itself.

Another additional difference is that Spark reserves some of its memory to cache DataSets, while Drill does not caching of data in memory after a query is executed.

In order to compare Spark and Drill apples to apples we would have to configure Spark and Drill to use the same amount of off heap and on heap memory for executing a SQL query. In the following example we will walk through how to configure Drill and spark to use 8gb of on heap memory and 8gb of off heap memory.

Drill Memory Config Example

Set the following in your drill-env.sh file on each Drillbit

export DRILL_HEAP="8G"  
export DRILL_MAX_DIRECT_MEMORY="8G"

Once these are configured restart your Drillbits and try your query. Your query may run out of memory because Drill's memory management is still under active development. In order to give you an out you can manually control Drill's memory usage with a query using the planner.width.max_per_node and planner.memory.max_query_memory_per_node options. These options are set in your drill-override.conf. Note you must change these options on all your nodes and restart your Drillbits for them to take effect. A more detailed explanation of these options can be found here.

Spark Memory Config Example

Create a properties file myspark.conf and pass it to the spark submit command. The spark properties file should include the following config.

# 8gb of heap memory for executor
spark.executor.memory            8g
# 8gb of heap memory for driver
spark.driver.memory              8g
# Enable off heap memory and use 8gb of it
spark.memory.offHeap.enabled     true
spark.memory.offHeap.size        8000000000
# Do not set aside memory for caching data frames
# Haven't tested if 0.0 works. If it doesn't make this
# as small as possible
spark.memory.storageFraction     0.0

Summary

Create a Drill cluster with N nodes, a Spark cluster with N executors and deploy a dedicated driver, try the memory configurations provided above, and run the same or a similar SQL query on both clusters. Hope this helps.