I have the following scenario. A dimension table, e.g. PRODUCT is loaded into SQL Server 2016. A fact table, e.g. ORDER_ITEM is loaded into Hadoop. I want to run an aggregate query across PRODUCT and ORDER_ITEM, e.g.
SELECT
PRODUCT.PRODUCT_CATEGORY,
SUM(ORDER_ITEM.AMOUNT)
FROM
HADOOP.ORDER_ITEM OI
JOIN RDBMS.PRODUCT P ON (OI.PRODUCT_ID = P.PRODUCT_ID)
GROUP BY
PRODUCT.PRODUCT_CATEGORY
What is the behaviour. (1) Does Polybase broadcast the PRODUCT dimension into Hadoop performs the join and aggregation there and returns the result (2) Does Polybase broadcast the ORDER_ITEM table to SQL Server and perfroms the join and aggregation there?
It's probably (2), but if someone has tried it out let me know
PolyBase never moves data from the SQL Server regardless of the data volume. Depending on the statistics, PolyBase would either:
A) Stream order_item table back to SQL Server and compute the join and aggregate the data.
B) Push down a partial aggregate Sum(Order_Item.Amount) group by OI.ProductiD, stream the result set to SQL Server, then do the join and final aggregation within SQL Server.