Search code examples
sql-serverprimary-keyclustered-indexsql-optimizationnon-clustered-index

Nonclustered primary key dilemma


Suppose we'd have to define optimal indexing for Stackoverflow questions. But let's not take the schema of the actual Posts table, let's just include those columns that are actually relevant:

create table Posts (
    Id int not null
        identity,
    PostTypeId tinyint not null,
    LastActivityDate datetime not null
        default getdate(),
    Title nvarchar(500) null, -- answers don't have titles
    Body nvarchar(max) not null,
    ...
)

I've added Id to be identity even though Data Stackexchange shows that none of the tables have a primary key constraint on them, nor identity columns. There are many just unique/non-unique clustered/non-clustered indices.

Usage scenarios

So basically two main scenarios for posts:

  1. They're chronologically displayed in descending order by their LastActivityDate column (or maybe LastEditDate that I haven't included above as it's not so important)
  2. They're individually displayed on question details
  3. Answers are displayed on question details page in votes order (ScoreCount column not part of my upper code)

Indexing optimization

Which indices would be best created on above scenarios especially if we'd say that #1 is the most common scenario so it has to work really fast.

I'd say that one of the better possibilities would be to create these indices:

-- index 1
alter table Posts
add primary key nonclustered (Id);

-- index 2
create clustered index IX_Posts_LastActivityDate
on Posts(LastActivityDate desc);

-- index 3
create index IX_Posts_ParentId
on Posts(ParentId, PostTypeId)
include (ScoreCount);

This way we basically end up with three indices of which the second one is clustered.

So in order for #1 to work really fast I've set clustered index on LastActivityDate column, because clustered indices are especially great when we do range comparison on them. And we would be ordering questions chronologically newest to oldest hence I've set ordering direction and also included type on the clustered index.

So what did we solve with this?

  1. scenario #1 is very efficiently covered by index 2 as it's clustered and fully covered; we can also easily and efficiently do result paging;
  2. scenario #2 is somewhat covered with unique index 1 (to get the question) and non-unique index 3 to get all related answers (scenario #3) ordered by ScoreCount; and if we decide to chronologically order answers that's also covered with index 2;

Question 1

SQL internals are such that SQL implicitly adds clustered key to nonclustering index so it can locate records in the row store.

  • if clustering index is unique, than that's the key that will be added to nonclustering indices, and
  • if clustering index is non-unique, SQL supposedly generates its own UniqueId and uses that

Since I've also added a nonclustered primary key on the table (which must by design be unique), I would like to know whether SQL will still supply its own unique key on clustered non-unique index or will it use nonclustered primary key to uniquely identify each records instead?

Question 2

So if primary key isn't used to locate records on row store (clustered index) does it even make sense to actually create a PK? Would in this case be better to rather do this?

create unique index UX_Posts_Id
on Posts(Id);
-- include (Title, Body, ScoreCount);

It would be great to also include commented out columns, but then that would make this index inefficient as it will be worse in caching... Why I'm asking whether it would be better to create this index instead of a primary key constraint is because we can include additional non-key columns to this index while we can't do the same when we add a PK constraint that internally generates a unique index...

Question 3

I'm aware that LastActivityDate changes which isn't desired with clustered indices, but we have to consider the fact that this column is more likely to change for some time before it becomes more or less static, so it shouldn't cause too much index fragmentation as records will mostly be appended to the end whenever LastActivityDate changes. Index fragmentation on some arbitrary page should never happen because some new record would be inserted into some old(er) page as LastActivityDate will only increase. Hence most modifications will happen on the last page.

So the question is whether these changes can be harmful as LastActivityDate isn't the best candidate for clustering index key:

  • it's not unique - although one could argue about this, especially if we'd change datetime to datetime2 and use higher precision function sysdatetime() and set index as unique
  • it's narrow - pretty much
  • it's not static - but I've explained how it changes
  • it's ever increasing

Solution

  • Since I've also added a nonclustered primary key on the table (which must by design be unique), I would like to know whether SQL will still supply its own unique key on clustered non-unique index or will it use nonclustered primary key to uniquely identify each records instead?

    SQL Server adds a 4-byte "uniqueifier" when a given non-unique clustered index key value isn't unique. All non-clustered index leaf nodes, including the primary key, will include LastActivityDate plus the uniqueifier (when present) as the row locator. The internal uniqueifier would be needed here only for posts with the same LastActivityDate so I'd expect relatively few rows would actually need a uniqueifier.

    So if primary key isn't used to locate records on row store (clustered index) does it even make sense to actually create a PK? Would in this case be better to rather do this?

    From a data modeling perspective, every relational table should have primary key. The implicitly created index can be declared as either clustered or non-clustered as needed to optimize performance. If LastActivity is a better choice for performance, then the primary key index must be non-clustered. This primary key index will provide the needed index to retrieve singleton posts.

    Unfortunately, SQL Server doesn't provide a way to specify included columns on primary key and unique constraint definitions. This is a case where one can bend the rules and use a unique index instead of a declared primary key constraint in order to avoid the cost of redundant indexes and the benefits of included columns. The unique index is functionally identical to a primary key and can be referenced by foreign key constraints.

    So the question is whether these changes can be harmful as LastActivityDate isn't the best candidate for clustering index key

    LastActivityDate alone can never be guaranteed to be unique regardless of the level of precision (barring single-threaded inserts or retry logic). One approach could be a composite primary key on LastActivityDate and Id. Individual posts would need to be retrieved using both values. That would eliminate the need for a separate unique index Id previously discussed.

    My biggest concern about LastActivityDate as the leftmost clustered index key column is that it may change often for recent posts. This would require a lot of row movement to maintain the logical key order, may impact concurrency significantly compared to the current static Id key, and require updates to the non-clustered index row locator values upon each change. So even though this clustered index key may be optimal for many queries, the other costs on a highly transactional system may outweigh the benefits.