Search code examples
polybase

Pushdown in Polybase


I have the following scenario. A dimension table, e.g. PRODUCT is loaded into SQL Server 2016. A fact table, e.g. ORDER_ITEM is loaded into Hadoop. I want to run an aggregate query across PRODUCT and ORDER_ITEM, e.g.

SELECT
 PRODUCT.PRODUCT_CATEGORY,
 SUM(ORDER_ITEM.AMOUNT)
FROM
 HADOOP.ORDER_ITEM OI
 JOIN RDBMS.PRODUCT P ON (OI.PRODUCT_ID = P.PRODUCT_ID)
GROUP BY
  PRODUCT.PRODUCT_CATEGORY

What is the behaviour. (1) Does Polybase broadcast the PRODUCT dimension into Hadoop performs the join and aggregation there and returns the result (2) Does Polybase broadcast the ORDER_ITEM table to SQL Server and perfroms the join and aggregation there?

It's probably (2), but if someone has tried it out let me know


Solution

  • PolyBase never moves data from the SQL Server regardless of the data volume. Depending on the statistics, PolyBase would either:

    A) Stream order_item table back to SQL Server and compute the join and aggregate the data.

    B) Push down a partial aggregate Sum(Order_Item.Amount) group by OI.ProductiD, stream the result set to SQL Server, then do the join and final aggregation within SQL Server.