database sql-execution-plan netezza sqlperformance sql-optimization

Netezza: How does the WHERE clause influence the estimated rows number in query verbose plan?

I execute a simple query:

SELECT * FROM TABLE1
WHERE ID > 9 AND ID < 11

and the query verbose plan is:

[SPU Sequential Scan table "TABLE1" {(TABLE1."ID")}]
-- Estimated Rows = 1, ...

But after changing the where clause to

WHERE ID = 10

the query verbose plan changes:

[SPU Sequential Scan table "TABLE1" {(TABLE1."ID")}]
-- Estimated Rows = 1000, ...

(where 1000 is the total number of rows in TABLE1).

Why is it so? How does the estimation work?

Solution

The optimizer of any cost-based database is always full of surprises, and this one is not unusual across the platforms im familiar with.

A couple of questions: - have you created statistics on the table? (otherwise you are flying blind) - what is the datatype for that column ? (i hope it is an integer of some sort, not a NUMBER(x,y), even if y=0)

Furthermore: The statistics for a column in netezza contains no distribution statistics (it won't know if there are more "solved" than "unsolved" cases in a support-system table with 5 years worth of data). Instead it relies on two things: 1) for all tables: simple statistics if you create them (number of distinct values, max+min values, number of nulls) 2) for large'ish tables (I think the configureable minimum value is close to 100 mill rows) it creates JIT syatistics (Just In Time) by scanning a few random data pages on the dataslices that all live up to the zone-mappable whereclauses and creating statistics for this one query.

The last feature is actually quite powerfull, even though is adds runtime to planning-phase of the query. It significantly increases the likelyhood that if there are SOME correlation between two whereclauses on a table, this will be taken into account. An example: a whereclause on (AGE>60 and Retired=true) in a list of all citizens in a major city. It is most likely more or less irrelevant to add the AGE restriction, and Netezza will know that.

In general you should not worry about estimated number of rows being a bit off (as in this case) with netezza, it will most often get it "right enough" and throw hardware at the problem to compensate for any minor mistakes.

Untill recently I worked with SQLserver which is notorius (may be better in newer version) for being overly optimistic about the value of where clauses, and ending up in access plans with 5 levels of nested-loop joins with millions of rows in each, when joining 6 tables. Changing where clauses much like you did in the question, will cause sqlserver to put LESS empathesis on a specific restriction, and that can cause the 5 joins to become a more efficient HASH or other algorithm, resulting in better performance. In my experience that is MUCH too frequent an occurance on databases that relies TOO heavily on these estimates - perhaps because the optimizer were not created/tuned for a warehouse-like workload.