Search code examples
apache-drilldremel

How does Dremel or its implementation (say Drill) handle large columnar data layout in memory?


I am going through the white paper of Google Dremel. I came to know it converts complex data into columnar data layout.

At what location is this data stored?

As Drill has no central metadata repository, I assume it must be in-memory.

Therefore how does Drill handle this data when I have billions of rows?


Solution

  • To get complete, consistent query results from billions of rows, you'll use a distributed file system connected to multiple Drillbits, simulate a distributed file system by copying files to each node, or use an NFS volume, such as Amazon Elastic File System. Drill performs performant querying of big data using a number of techniques, including these:

    • Relies on the cluster nodes to handle failures (doesn't spend time on failure-related tasks).
    • Uses an in-memory data model that's hierarchical and columnar (doesn't access the disk for columns that are not involved in an analytic query, processing the columnar data without row materialization).
    • Uses columnar storage optimizations and execution (keeps memory footprint low).
    • Uses vectorization to work on arrays of values from different records rather than single values from one record at a time.

    For more information, see http://drill.apache.org/docs/performance/.