Search code examples
sql-serverperformancesql-server-2012query-tuning

Case of using filtered statistics


I was going through filtered stats in below link.

http://blogs.msdn.com/b/psssql/archive/2010/09/28/case-of-using-filtered-statistics.aspx

Data is Skewed heavily,one region is having 0 rows,rest all are from diferent regions. Below is the entire code to reproduce the issue

create table Region(id int, name nvarchar(100)) 
go 
create table Sales(id int, detail int) 
go 
create clustered index d1 on Region(id) 
go 
create index ix_Region_name on Region(name) 
go 
create statistics ix_Region_id_name on Region(id, name) 
go 
create clustered index ix_Sales_id_detail on Sales(id, detail) 
go

-- only two values in this table as lookup or dim table 
insert Region values(0, 'Dallas') 
insert Region values(1, 'New York') 
go

set nocount on 
-- Sales is skewed 
insert Sales values(0, 0) 
declare @i int 
set @i = 1 
while @i <= 1000 begin 
insert Sales  values (1, @i) 
set @i = @i + 1 
end 
go

update statistics Region with fullscan 
update statistics Sales with fullscan 
go

set statistics profile on 
go 
--note that this query will over estimate 
-- it estimate there will be 500.5 rows 
select detail from Region join Sales on Region.id = Sales.id where name='Dallas' option (recompile) 
--this query will under estimate 
-- this query will also estimate 500.5 rows in fact 1000 rows returned 
select detail from Region join Sales on Region.id = Sales.id where name='New York' option (recompile) 
go

set statistics profile off 
go

create statistics Region_stats_id on Region (id) 
where name = 'Dallas' 
go 
create statistics  Region_stats_id2 on Region (id) 
where name = 'New York' 
go

set statistics profile on 
go 
--now the estimate becomes accurate (1 row) because 
select detail from Region join Sales on Region.id = Sales.id where name='Dallas' option (recompile)

--the estimate becomes accurate (1000 rows) because stats Region_stats_id2 is used to evaluate 
select detail from Region join Sales on Region.id = Sales.id where name='New York' option (recompile) 
go

set statistics profile off

My question is we have below stats available on both tables

sp_helpstats 'region','all'
sp_helpstats 'sales','all'

Table region:

statistics_name   statistics_keys
d1                    id
ix_Region_id_name     id, name
ix_Region_name        name

Table sales:

statistics_name    statistics_keys
ix_Sales_id_detail     id, detail

1.Why the estimation went wrong for thse below queries

select detail from Region join Sales on Region.id = Sales.id where name='Dallas' option (recompile)

--the estimate becomes accurate (1000 rows) because stats Region_stats_id2 is used to evaluate 
select detail from Region join Sales on Region.id = Sales.id where name='New York' option (recompile) 

2.When i created filtered stat as per author,i could see estimates correctly,but why we need to create filtered stats,how can i say i need filtered stats for my queries since even when i created simple stats,i got same result .

Best i came across so far 1.Kimberely tripp skewed stats video
2.Technet stats whitepaper

But still not able to understand why filtered stats made a difference here

thanks in advance. Update :7/4

Rephrasing the question after martin and james answers:

1.Is there any way to avoid data skewness
other than kimberely script ,one more way to estimate is to count number of rows for a value.

2.Have you faced any issues with data skewness in your experience.I assume it depends on large tables.But i am looking for some detailed answer

3.We have to take the IO cost for the sql to scan the table and along with some blockings sometimes for a query which falls at the time of triggering update stats.do you see any overhead other than this in maintaining stats.

Reason being i am thinking to create filetered stats based on several conditions based on DTA input too.

thanks again


Solution

  • I would assume this is why it happens. You get the same estimate (500.5) rows because that SQL Server doesn't have statistics that would tell which IDs are the one that are related to which region. The statistics ix_Region_id_name have both fields, but since histogram exists for the first column only, it really doesn't help in estimations regarding how many rows will be in Sales table.

    If you run dbcc show_statistics ('Region','ix_Region_id_name'), the result will be:

    RANGE_HI_KEY   RANGE_ROWS   EQ_ROWS   DISTINCT_RANGE_ROWS   AVG_RANGE_ROWS
    0              0            1         0                     1
    1              0            1         0                     1
    

    So this tells that there is 1 row for each ID, but there's no link to the names.

    But when you create the statistics Region_stats_id (for Dallas) dbcc show_statistics ('Region','Region_stats_id') will show:

    RANGE_HI_KEY   RANGE_ROWS   EQ_ROWS   DISTINCT_RANGE_ROWS   AVG_RANGE_ROWS
    0              0            1         0                     1
    

    So SQL Server knows that there is only 1 row, and it's ID 0.

    Similarly Region_stats_id2:

    RANGE_HI_KEY   RANGE_ROWS   EQ_ROWS   DISTINCT_RANGE_ROWS   AVG_RANGE_ROWS
    1              0            1         0                     1
    

    And the amount of rows in sales is in ix_Sales_id_detail will help to determine rows per ID:

    RANGE_HI_KEY   RANGE_ROWS   EQ_ROWS   DISTINCT_RANGE_ROWS   AVG_RANGE_ROWS
    0              0            1         0                     1
    1              0            1000      0                     1
    

    Info: This is now copy of the answer deleted by @MartijnPieters because this is the question I intended to answer for -- and I can't seem to do anything to the deleted answer. I accidentally wrote this first to TheGameiswar's other statistics question from today but I deleted myself already.