I'm currently confronted with a strange behaviour in my database when I'm querying a minimum ID for a specific date in a table contains about a hundred million rows. The query is quite simple :
SELECT MIN(Id) FROM Connection WITH(NOLOCK) WHERE DateConnection = '2012-06-26'
This query nevers end, at least I let it run for hours. The DateConnection column is not an index neither included in one. So I would understand that this query can last quite a bit. But I tried the following query which runs in few seconds :
SELECT Id FROM Connection WITH(NOLOCK) WHERE DateConnection = '2012-06-26'
It returns 300k rows.
My table is defined as this :
CREATE TABLE [dbo].[Connection](
[Id] [bigint] IDENTITY(1,1) NOT NULL,
[DateConnection] [datetime] NOT NULL,
[TimeConnection] [time](7) NOT NULL,
[Hour] AS (datepart(hour,[TimeConnection])) PERSISTED NOT NULL,
CONSTRAINT [PK_Connection] PRIMARY KEY CLUSTERED
(
[Hour] ASC,
[Id] ASC
)
)
And it has the following index :
CREATE UNIQUE NONCLUSTERED INDEX [IX_Connection_Id] ON [dbo].[Connection]
(
[Id] ASC
)ON [PRIMARY]
One solutions I find using this strange behaviour is using the following code. But it seems to me quite a bit heavy for such a simple query.
create table #TempId
(
[Id] bigint
)
go
insert into #TempId
select id from partitionned_connection with(nolock) where dateconnection = '2012-06-26'
declare @displayId bigint
select @displayId = min(Id) from #CoIdTest
print @displayId
go
drop table #TempId
go
Has anybody been confronted to this behaviour and what is the cause of it ? Is the minimum aggregate scanning the entire table ? And if this is the case why the simple select does not ?
The root cause of the problem is the non-aligned nonclustered index, combined with the statistical limitation Martin Smith points out (see his answer to another question for details).
Your table is partitioned on [Hour]
along these lines:
CREATE PARTITION FUNCTION PF (integer)
AS RANGE RIGHT
FOR VALUES (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23);
CREATE PARTITION SCHEME PS
AS PARTITION PF ALL TO ([PRIMARY]);
-- Partitioned
CREATE TABLE dbo.Connection
(
Id bigint IDENTITY(1,1) NOT NULL,
DateConnection datetime NOT NULL,
TimeConnection time(7) NOT NULL,
[Hour] AS (DATEPART(HOUR, TimeConnection)) PERSISTED NOT NULL,
CONSTRAINT [PK_Connection]
PRIMARY KEY CLUSTERED
(
[Hour] ASC,
[Id] ASC
)
ON PS ([Hour])
);
-- Not partitioned
CREATE UNIQUE NONCLUSTERED INDEX [IX_Connection_Id]
ON dbo.Connection
(
Id ASC
)ON [PRIMARY];
-- Pretend there are lots of rows
UPDATE STATISTICS dbo.Connection WITH ROWCOUNT = 200000000, PAGECOUNT = 4000000;
The query and execution plan are:
SELECT
MinID = MIN(c.Id)
FROM dbo.Connection AS c WITH (READUNCOMMITTED)
WHERE
c.DateConnection = '2012-06-26';
The optimizer takes advantage of the index (ordered on Id
) to transform the MIN
aggregate to a TOP (1)
- since the minimum value will by definition be the first value encountered in the ordered stream. (If the nonclustered index were also partitioned, the optimizer would not choose this strategy since the required ordering would be lost).
The slight complication is that we also need to apply the predicate in the WHERE
clause, which requires a lookup to the base table to fetch the DateConnection
value. The statistical limitation Martin mentions explains why the optimizer estimates it will only need to check 119 rows from the ordered index before finding one with a DateConnection
value that will match the WHERE clause
. The hidden correlation between DateConnection
and Id
values means this estimate is a very long way off.
In case you are interested, the Compute Scalar calculates which partition to perform the Key Lookup into. For each row from the nonclustered index, it computes an expression like [PtnId1000] = Scalar Operator(RangePartitionNew([dbo].[Connection].[Hour] as [c].[Hour],(1),(1),(2),(3),(4),(5),(6),(7),(8),(9),(10),(11),(12),(13),(14),(15),(16),(17),(18),(19),(20),(21),(22),(23)))
, and this is used as the leading key of the lookup seek. There is prefetching (read-ahead) on the nested loops join, but this needs to be an ordered prefetch to preserve the sorting required by the TOP (1)
optimization.
We can avoid the statistical limitation (without using query hints) by finding the minimum Id
for each Hour
value, and then taking the minimum of the per-hour minimums:
-- Global minimum
SELECT
MinID = MIN(PerHour.MinId)
FROM
(
-- Local minimums (for each distinct hour value)
SELECT
MinID = MIN(c.Id)
FROM dbo.Connection AS c WITH(READUNCOMMITTED)
WHERE
c.DateConnection = '2012-06-26'
GROUP BY
c.[Hour]
) AS PerHour;
The execution plan is:
If parallelism is enabled, you will see a plan more like the following, which uses parallel index scan and multi-threaded stream aggregates to produce the result even faster: