azure database-administration azure-sql-managed-instance

Azure SQL Managed Instance randomly freezes/locks up and must be Stopped/Restarted

I am a Developer who is also responsible for Database Administration at my company. We have several Microsoft SQL servers including one Azure Managed Instance SQL server. Recently and at random times all queries will fail with execution timeout errors and will continue to fail until I log into the Azure Portal and "Stop" the server, then "Start" it again. I noticed from the Azure Portal dashboard that at the same time this happens, the average CPU usage will drop to nearly 0% (it's almost always 50%-60% normally). I have now set alerts to notify me when the CPU usage drops below 10%. This may happen once a week or even less frequently. Sometimes it can go for several weeks in between occurrences. The first time I remember this happening was maybe 2 months ago. I have not noticed a discernable pattern in the occurrences.

Recently, we had an issue where SELECTs and other low overhead queries would still succeed but high overhead queries such as trying to INSERT PDF files (in base64 format) and DROP INDEX statements would fail with the same execution timeout error. I spent nearly a day digging through my code and testing the same INSERT statement on multiple servers including my own computer. For my testing I ended up canceling the query when it did not finish after 11 minutes (query normally takes less than 30 seconds). I checked for long-running or hung transactions, the oldest still-running query was around 10 minutes (this issue had been going on for hours at this point). Running out of ideas, I decided to try "Stopping" and "Starting" the server, and sure enough this fixed the issue.

Yes I do have a workaround for this, but it would be very inconvenient if this happened in the middle of the night or on a weekend, etc, when I am away from my computer. I am hesitant to contact Microsoft Azure support because I think they would have trouble diagnosing the issue if it is not actively happening at the moment. Also, the one experience I had with Microsoft Azure support, they were less than helpful. I spent 5-6 hours on-and-off the phone with them, all the while our server was completely unreachable, and ultimately I stumbled across a reset button in the Azure Portal and ended up fixing the issue myself. But I don't have any clue how to further diagnose and ultimately resolve this issue. Has anyone run into this before?

Solution

An average 50-60% CPU utilization indicates there is some serious use going on here with potential for a lot of inefficiencies which might be the root causes for this failure. This is not likely to be an issue in the infrastructure itself but more in your application logic and how it uses the database.

Immediate red flags:

trying to INSERT PDF files (in base64 format)

So while you can manage files in an SQL database, it is expensive, less efficient and less secure than using other storage solutions. From a performance point of view many developers follow really poor code examples of how to upload and access files in SQL databases. There are acceptable solutions to this out there, but it requires discipline and knowledge, especially if you are using an ORM like Entity Framework or NHibernate to access the data. You have to carefully design around the tables that contain file blobs and how that data is accessed, but also in the storage itself there are specific techniques like partitioned indexes, partitioned file groups and file streaming that should be considered.

Simply put, storing files in the database bloats it and takes up space that doesn't need to be there, this has an overall impact on the whole environment.
A better approach is to store the files outside of the database and store a pointer to that file. This forces you to break up the logic to differentiate between metadata and the documents themselves, for almost all application designs, the metadata is queried and consumed by the user more than the physical file itself, structuring your schema to separate these concerns is a good start, but physically storing the files in another medium gives you more options around managing access, security and balancing performance vs cost.

DROP INDEX statements would fail

If your database activity is high and your normal application logic includes DROP INDEX statements, and those statements are timing out, this indicates that your table is in use.

What are you doing in the application level to ensure the table is not being used before you are dropping the index?
Why are your dropping the index in a table that is in use?
Are you checking if the table is in use before you drop it?

The first time I remember this happening was maybe 2 months ago. I have not noticed a discernable pattern in the occurrences.

A common reaction with application developers that start to see these issues is to increase logging within the application layer. If you are not using external log storage providers and are instead writing log entries into your database then you might find that your logging methodology is making the problem worse, even if you are only logging exceptions, as the DB stops responding, your application will try to flood the database with logs about the failure which only further exacerbates the problem because now those logs can't even be recorded.

If you are logging into your application database, move these logs out to a separate database, that way your observations and telemetry will have less impact on the application runtime itself.
There are better purpose built logging and application performance monitoring (APM) tools and services available, try to limit what you manually record to very specific application logic events and use other tools to monitor for you to reduce the impact that observability might be having on your system.

General Diagnosis

In general, this level of noise means that there is reduced capacity to handle bursts of activity, it is those bursts that result in failure, especially when your application logic allows accidental race conditions.

The basics for monitoring are documented here: Monitoring Microsoft Azure SQL Managed Instance performance using dynamic management views

Dead-locks, slow or long-running queries at the time of failure are a symptom and not always the cause, especially if it is a general race condition that has lead to failure. This might explain why during the failure you cannot identify the pattern, what is more important are the events leading up to it.

You can identify race conditions through connection count and CPU utilization, you should see patterns where utilization starts increasing gradually and then exponentially until it spikes and then probably after a period of time at 100% it drops to almost zero.
When monitoring for these conditions, set your alerts to unusually high CPU usage to get your attention before it fails, only responding after usage has dropped is too late, but also low or no genuine activity in the database could give you false readings, especially if you do make your application logic more efficient.
Look for long running transactions not just queries, long transactions will affect application performance if your connections are not using an appropriate Isolation Level

What are common causes for this in application design specifically against SQL Server?

Poor indexing strategy and or poorly managed indexes that might have high levels of fragmentation. You might have some really thought out indexes, but have you checked index utilization rates, is your application actually using the indexes the way you have imagined?
SQL Scripts: How To Find Index Usage
High concurrent connection rates can be a massive problem, when we develop and test it is often in isolated environments, so it is not immediately clear what the effect might be of different types of queries running at the same time. A single http request in a web service could spawn multiple connections to the database if you do it wrong. Check your connection counts and compare that to your expected client or server connections Monitoring connections
Misuse of ORMs and frameworks like LINQ can lead to inefficient SQL queries, if you cannot optimise your queries using language tools available to you then you might have to explore using stored procedures or raw SQL that you have optimised for certain problematic queries.
Lack of consistency between queries and conditional operands. Consistency in your queries or a convention of re-using queries or parts of them will simplify performance tuning. Indexes are great, but having too many indexes can lead to high storage costs, will require more maintenance and can negatively influence performance, especially the performance of write operations. Tools like ORMs and LINQ expressions in your code have a benefit of standardising the way even ad-hoc queries are compiled into SQL. So even though they can lead to misuse themselves, they have a potential when used right to improve the base level of performance you could expect before you start to optimise for indexes.
High read rates of complicated or aggregated queries that return slow changing result sets. In application design we tend towards more execution on the server because in isolated dev environments it is more performant than doing it in the application logic. This results in complex queries that return just the right level of information that we think we need that is accurate for that point in time but may not play well with multiple concurrent requests from different clients at the same time.

An Immediate plan of action

Increase resources to allow you to weather the storm, double resources is usually enough, get the hum of "normal" activity to below 20% if you can, but certainly below 50%.
Try to identify inefficiencies using some of the guidance above before you try to change them, don't play whack-a-mole. For each potential risk that you identify, grade it in terms of frequency, impact and complexity of the solution. Identify associated metrics and make sure you understand how to monitor or record them.
Be pragmatic, implement what you can with minimal effort first and do not try to fix something that you cannot measure. If you can't prove that there is an improvement or you have no way to measure the improvement, then do not take that on first, there will be no way to determine when you might be done with that improvement task.
Prioritise solutions that more the heavy-lifting away from read operations to write operations if your application is predominantly read-heavy (as most are).

Architecture Solutions

We already talked about moving file storage out of the database, for all it's goodness, there are better and cheaper purpose built tools out there for providing file storage services. Using SQL Server for file storage requires specific infrastructure and code optimizations to make it viable, you will be far better off migrating away from files in the database.

A practical thing you can do in the database alone is to manage your indexes. Ideally your application and data access layer should be optimised to use specific indexes and in many cases changing the query logic in the ocde is a better solution, however you can work in reverse and start at the database to Tune nonclustered indexes with missing index suggestions. Don't just blindly implement all suggestions, use the feedback and you might identify specific areas in your application that could be enhanced to use indexes better. Also look here Optimize index maintenance to improve query performance and reduce resource consumption

A quick-fix in the application is often to ensure you are using an apprioriate ISOLATION LEVEL on your connections. If there is high DB activity Read Committed often offers the best practical balance between performance and consistency. The main thing to worry about is the transactions themselves. If you are not using long transactions that need to re-query the same data and rely on returning the same records then Read Committed is fine for general use. If you are using long transactions that re-query, consider using temporary tables or table variables or other methods to cache the data rather than re-querying it, use Read Uncommitted with care... If you are not using stored procedures or even transactions, then in your application you may need to take specific steps to set the isolation level for the default transaction scope your application queries execute in.

If you have a high volume of reads from specific functional areas or types of interactions, you might consider Using read-only replicas to offload read-only query workloads. This is not always the best solution, but depending on your application and individual use case you might find that easier to implement at first.

At the application or service level, look for patterns of queries that repeatedly call the database and return the same data, especially for static data that is not likely to change for either known periods of time or the duration of the runtime execution, these will be good candidates for caching or at least using a memoize pattern.

If you have a middle-tier like web services or a hosted API that spans multiple server hosts (think web farm) then you should consider using a distributed cache like REDIS instead of rudimentary local process or file based caching.

Traditional caching isn't always the silver bullet, for slow moving datasets, Implementing Materialized Views can make a lot of sense. This is not something offered OOTB in SQL Server, but you can still implement this concept in your application logic, think of it as manual indexing.

Materialized views are especially useful in situations where complex queries or aggregations are performed frequently, and the underlying data changes infrequently. By storing the precomputed results, the database can avoid the need to execute complex queries repeatedly, resulting in faster response times.

To implement, using either a timer or data change event based trigger, execute a normally expensive query across your data and store the results in another table. Then those read operations that would normally execute the query can be converted into simple statements that query from your improvised Materialized View. This is just another form of cache except we are caching within the database itself.

Many application design concepts can benefit from this concept, in the 90s this concept was the first way that we used to think of when design data driven applications, we had to due to operational costs and physical hardware constraints, it is only since the advancement of cheaper and faster memory and processors that SQL Server has evolved into a beast that allows us to be lazy in our application design and execution and still get away with it.

Materialized Views, once implemented could then also be transitioned either to external databases or other data providers, like REDIS Database or Azure AI Search, or you could use those services instead of manually managing MVs in the database directly.

This is not an exhaustive list of things, just those that immediately sprung to my mind in seeing the OP. There is a lot to unpack, but I hope that some of the links here can direct you to resources and ideas that will help you solve your conundrum. It might be that reaching out for direct mentorship or commercial consultancy to sort this out could come to a resolution quicker, but where is the fun in that ;)