Why did SQL Server suddenly decide to use such a terrible execution plan?

Background

We recently had an issue with query plans sql server was using on one of our larger tables (around 175,000,000 rows). The column and index structure of the table has not changed for 5+ years.

The table and indexes looks like this:

create table responses (
    response_uuid uniqueidentifier not null,
    session_uuid uniqueidentifier not null,
    create_datetime datetime not null,
    create_user_uuid uniqueidentifier not null,
    update_datetime datetime not null,
    update_user_uuid uniqueidentifier not null,
    question_id int not null,
    response_data varchar(4096) null,
    question_type_id varchar(3) not null,
    question_length tinyint null,
    constraint pk_responses primary key clustered (response_uuid),
    constraint idx_responses__session_uuid__question_id unique nonclustered (session_uuid asc, question_id asc) with (fillfactor=80),
    constraint fk_responses_sessions__session_uuid foreign key(session_uuid) references dbo.sessions (session_uuid),
    constraint fk_responses_users__create_user_uuid foreign key(create_user_uuid) references dbo.users (user_uuid),
    constraint fk_responses_users__update_user_uuid foreign key(update_user_uuid) references dbo.users (user_uuid)
)

create nonclustered index idx_responses__session_uuid_fk on responses(session_uuid) with (fillfactor=80)

The query that was performing poorly (~2.5 minutes instead of the normal <1 second performance) looks like this:

SELECT 
[Extent1].[response_uuid] AS [response_uuid], 
[Extent1].[session_uuid] AS [session_uuid], 
[Extent1].[create_datetime] AS [create_datetime], 
[Extent1].[create_user_uuid] AS [create_user_uuid], 
[Extent1].[update_datetime] AS [update_datetime], 
[Extent1].[update_user_uuid] AS [update_user_uuid], 
[Extent1].[question_id] AS [question_id], 
[Extent1].[response_data] AS [response_data], 
[Extent1].[question_type_id] AS [question_type_id], 
[Extent1].[question_length] AS [question_length]
FROM [dbo].[responses] AS [Extent1]
WHERE [Extent1].[session_uuid] = @f6_p__linq__0;

(The query is generated by entity framework and executed using sp_executesql)

The execution plan during the poor performance period looked like this:

execution plan

Some background on the data- running the query above would never return more than 400 rows. In other words, filtering on session_uuid really pares down the result set.

Some background on scheduled maintenance- a scheduled job runs on a weekly basis to rebuild the database's statistics and rebuild the table's indexes. The job runs a script that looks like this:

alter index all on responses rebuild with (fillfactor=80)

The resolution for the performance problem was to run the rebuild index script (above) on this table.

Other possibly relevant tidbits of information... The data distribution didn't change at all since the last index rebuild. There are no joins in the query. We're a SAAS shop, we have at 50 - 100 live production databases with exactly the same schema, some with more data, some with less, all with the same queries executing against them spread across a few sql servers.

Question:

What could have happened that would make sql server start using this terrible execution plan in this particular database?

Keep in mind the problem was solved by simply rebuilding the indexes on the table.

Maybe a better question is "what are the circumstances where sql server would stop using an index?"

Another way of looking at it is "why would the optimizer not use an index that was rebuilt a few days ago and then start using it again after doing an emergency rebuild of the index once we noticed the bad query plan?"

Solution

The reason is simple: the optimizer changes its mind on what the best plan is. This can be due to subtle changes in the distribution of the data (or other reasons, such as a type incompatibility in a join key). I wish there were a tool that not only gave the execution plan for a query but also showed thresholds for how close you are to another execution plan. Or a tool that would let you stash an execution plan and give an alert if the same query starts using a different plan.

I've asked myself this exact same question on more than one occasion. You have a system that's running nightly, for months on end. It processes lots of data using really complicated queries. Then, one day, you come in in the morning and the job that normally finishes by 11:00 p.m. is still running. Arrrggg!

The solution that we came up with was to use explicit join hints for the failed joins. (option (merge join, hash join)). We also started saving the execution plans for all our complex queries, so we could compare changes from one night to the next. In the end, this was of more academic interest than practical interest -- when the plans changed, we were already suffering from a bad execution plan.