Search code examples
performancedistributionnetezzasql-execution-plan

Netezza/PureData - Bad distribution key chosen in HASH JOIN


I am using Netezza/Pure Data for a query. I have a INNER JOIN (which became a HASH JOIN) on two columns A and B. A is a column that has good distribution and B is a column that has bad distribution. For some reason, my query plan always uses B instead A as the distribution key for that JOIN, which causes immense performance issue.

GENERATE STATISTICS does help alleviate this issue, but due to performance constraints, it is not feasiable to GENERATE STATISTICS before every query. I do it before a batch run but not in between each query within a batch.

In a nutshell, the source tables have good distributions but when I join them, they choose a bad distribution key (which is actually never used as a distribution column at all in the sources).

So my question is, what are some good ways to influence the choice of distribution key in a JOIN without doing GENERATE STATISTICS. I've tried changing around the distribution columns of the source tables but that didn't do much even if I make sure all the skew's are less than 0.5.


Solution

  • The workaround is to force exhaustive planner to be used.

    set num_star_planner_rels = X; -- Set X to very high.

    According to IBM Netezza team, queries with more than 7 entities (# of tables) will use a greedy query planner called "Snowflake". At 7 or less entities, it will use the brute force approach to find the best plan.

    The trade off is that exhaustive search is very expensive for large number of entities.