Search code examples
distributedapache-drill

How to configure drill to use all the nodes for a query (by creating multiple fragements)


I am using Drill (1.3) on two nodes. Say:

  1. 192.xxx.xxx.xxx
  2. 192.yyy.yyy.yyy

I tried querying (from 192.xxx.xxx.xxx) on a csv file (1000 million records):

select count(*) from dfs.`home/impadmin/BiggerBoy.csv`

Also, I tried join query (from 192.xxx.xxx.xxx) on Hive & Oracle :

select * from hive.testDB.`catalog_sales` x inner join oracle.ILABUSER.`customer_address` y on y.CA_ADDRESS_SK = x.CS_BILL_ADDR_SK group by  y.CA_CITY limit 100

Every time I got(from Drill UI):

Query Profile
STATE: COMPLETED

FOREMAN: 192.xxx.xxx.xxx

TOTAL FRAGMENTS: 1

Why the other node is not used. Then whats the benefit of using multiple nodes in this case.

Do Drill take care of this by itself or I need to configure something?

If anybody able to get multiple fragment please share your use case.


Solution

  • Assuming you're using a distributed file system, I understand from this post that a local file system plugin (dfs) doesn't work with multiple drill bits. Although the referenced post primarily addresses a question about writes, it sounds applicable to your question about reads.

    To configure Drill to use multiple nodes, see the subsection under Installing Drill in Distributed Mode.

    Query distribution depends on query complexity. When the planner builds the query plan, it splits the plan into multiple major fragments and generally, there is some distribution between them. In a single node you can have multiple minor fragments running in the same node, e.g., on a 32 column machine you can run up to 23 minor fragments, which is about 75%. On multiple nodes, for example on 4 nodes, each node might run 23 minor fragments for the same query.

    If you have a single root fragment that runs on the foreman node, Drill cannot split that. The distribution on leaf fragments depends on the query and is limited by the number of splittable inputs. If you have a single non-splittable file, the query plan uses a single leaf. If you have an intermediate fragment in the plan, it can be distributed. I could not get the details of how distribution of a single leaf and intermediate fragment is limited to one node.

    In the query profile, when you click the root fragment, you see only single minor fragments and the host name for each of the fragments are the same as the foreman name. If you click one of multiple major fragments in the query profile, you see the different host names to which the query has been distributed.