I have a SQL Server database with EventJournal table with the following columns:
Ordering (bigint, primary key)
PersistenceID (nvarchar(255))
SequenceNr (bigint)
Payload (varbinary(max))
Other columns are omitted for clarity. In addition to the primary key on Ordering there is a unique constraint on PersistenceID+SequenceNr.
If I run a query
select top 100 * from EventJournal where PersistenceID like 'msc:%'
... it takes very long time to execute (the table contains more than 100M rows)
But if I add ordering to results:
select top 100 * from EventJournal where PersistenceID like 'msc:%' order by Ordering
... then it returns the result immediately.
The execution plan for both queries are the same and in essence is the clustered index scan on PK. Then why does the first query take long time to execute?
Here's the table definition:
CREATE TABLE [dbo].[EventJournal](
[PersistenceID] [nvarchar](255) NOT NULL,
[SequenceNr] [bigint] NOT NULL,
[IsDeleted] [bit] NOT NULL,
[Manifest] [nvarchar](500) NOT NULL,
[Payload] [varbinary](max) NOT NULL,
[Timestamp] [bigint] NOT NULL,
[Tags] [nvarchar](100) NULL,
[Ordering] [bigint] IDENTITY(1,1) NOT NULL,
[SerializerId] [int] NULL,
CONSTRAINT [PK_EventJournal] PRIMARY KEY CLUSTERED
(
[Ordering] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY],
CONSTRAINT [QU_EventJournal] UNIQUE NONCLUSTERED
(
[PersistenceID] ASC,
[SequenceNr] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]
This it 1st plan:
https://www.brentozar.com/pastetheplan/?id=SJ3kCo-Fv
And here's the 2nd one:
As I mentioned in my comments, the plans are different, the difference is in access method:
The first plan uses unordered scan:
And the second plan uses ordered scan:
By the way, the other answer suggests useless index
.
SQL Server will NOT use this index
, as it's equivalent to the non-clustered index
already in place. As the index QU_EventJournal
on ([PersistenceID], [SequenceNr])
was not used, the same way the index on (PersistenceID, Ordering)
will not be used.
Both of these indexes has PersistenceID
, Ordering
in the index as Ordering
is clustered index key
, so it is presented in index on ([PersistenceID], [SequenceNr])
even if you don't see it in the definition. The suggested index will be also bigger as it is not defined as unique
, and the sizes of other fields are the same: Ordering
is bigint
, SequenceNr
is bigint
.
It's wrong to think that in index on 2 fields the second field(Ordering) can be used to avoid the sort in the second query, it's not true.
For example the index on PersistenceID
, Ordering
can have rows like these:
I hope you clearly see that the index is ordered by PersistenceID
, Ordering
,
but the result of the second query is expected to be
So the SORT
operator is needed, so this index will not be used.
Now to your question:
shouldn't lack of ORDER BY be used by the query analyzer as an opportunity to build more efficient execution plan without ordering constraints
Yes you are correct, without order by server is free to choose both the ordered
and unordered
scan, and yes you are right in this:
I also don't understand why using TOP without ORDER BY is a bad practice in case I want ANY N rows from the result
When you don't need top N ordered by, because you just want to see what kind of records have 'msc:' in them, you should not add order by because it could cause a SORT
in your plan.
And to your main question:
Then why does the first query take long time to execute?
The answer is: this was pure coincidence. Your data is laying in way that the rows with 'msc:' in them go first, in the order defined by Ordering. And if you scan your index not in order they are just in the middle or at the end of the table.
If you seek for another pattern in PersistenceID
the unordered scan
will be faster