I am a Developer who is also responsible for Database Administration at my company. We have several Microsoft SQL servers including one Azure Managed Instance SQL server. Recently and at random times all queries will fail with execution timeout errors and will continue to fail until I log into the Azure Portal and "Stop" the server, then "Start" it again. I noticed from the Azure Portal dashboard that at the same time this happens, the average CPU usage will drop to nearly 0% (it's almost always 50%-60% normally). I have now set alerts to notify me when the CPU usage drops below 10%. This may happen once a week or even less frequently. Sometimes it can go for several weeks in between occurrences. The first time I remember this happening was maybe 2 months ago. I have not noticed a discernable pattern in the occurrences.
Recently, we had an issue where SELECT
s and other low overhead queries would still succeed but high overhead queries such as trying to INSERT
PDF files (in base64 format) and DROP INDEX
statements would fail with the same execution timeout error. I spent nearly a day digging through my code and testing the same INSERT
statement on multiple servers including my own computer. For my testing I ended up canceling the query when it did not finish after 11 minutes (query normally takes less than 30 seconds). I checked for long-running or hung transactions, the oldest still-running query was around 10 minutes (this issue had been going on for hours at this point). Running out of ideas, I decided to try "Stopping" and "Starting" the server, and sure enough this fixed the issue.
Yes I do have a workaround for this, but it would be very inconvenient if this happened in the middle of the night or on a weekend, etc, when I am away from my computer. I am hesitant to contact Microsoft Azure support because I think they would have trouble diagnosing the issue if it is not actively happening at the moment. Also, the one experience I had with Microsoft Azure support, they were less than helpful. I spent 5-6 hours on-and-off the phone with them, all the while our server was completely unreachable, and ultimately I stumbled across a reset button in the Azure Portal and ended up fixing the issue myself. But I don't have any clue how to further diagnose and ultimately resolve this issue. Has anyone run into this before?
An average 50-60% CPU utilization indicates there is some serious use going on here with potential for a lot of inefficiencies which might be the root causes for this failure. This is not likely to be an issue in the infrastructure itself but more in your application logic and how it uses the database.
trying to INSERT PDF files (in base64 format)
So while you can manage files in an SQL database, it is expensive, less efficient and less secure than using other storage solutions. From a performance point of view many developers follow really poor code examples of how to upload and access files in SQL databases. There are acceptable solutions to this out there, but it requires discipline and knowledge, especially if you are using an ORM like Entity Framework or NHibernate to access the data. You have to carefully design around the tables that contain file blobs and how that data is accessed, but also in the storage itself there are specific techniques like partitioned indexes, partitioned file groups and file streaming that should be considered.
DROP INDEX
statements would fail
If your database activity is high and your normal application logic includes DROP INDEX
statements, and those statements are timing out, this indicates that your table is in use.
The first time I remember this happening was maybe 2 months ago. I have not noticed a discernable pattern in the occurrences.
A common reaction with application developers that start to see these issues is to increase logging within the application layer. If you are not using external log storage providers and are instead writing log entries into your database then you might find that your logging methodology is making the problem worse, even if you are only logging exceptions, as the DB stops responding, your application will try to flood the database with logs about the failure which only further exacerbates the problem because now those logs can't even be recorded.
In general, this level of noise means that there is reduced capacity to handle bursts of activity, it is those bursts that result in failure, especially when your application logic allows accidental race conditions.
The basics for monitoring are documented here: Monitoring Microsoft Azure SQL Managed Instance performance using dynamic management views
Dead-locks, slow or long-running queries at the time of failure are a symptom and not always the cause, especially if it is a general race condition that has lead to failure. This might explain why during the failure you cannot identify the pattern, what is more important are the events leading up to it.
What are common causes for this in application design specifically against SQL Server?
Increase resources to allow you to weather the storm, double resources is usually enough, get the hum of "normal" activity to below 20% if you can, but certainly below 50%.
Try to identify inefficiencies using some of the guidance above before you try to change them, don't play whack-a-mole. For each potential risk that you identify, grade it in terms of frequency, impact and complexity of the solution. Identify associated metrics and make sure you understand how to monitor or record them.
Be pragmatic, implement what you can with minimal effort first and do not try to fix something that you cannot measure. If you can't prove that there is an improvement or you have no way to measure the improvement, then do not take that on first, there will be no way to determine when you might be done with that improvement task.
Prioritise solutions that more the heavy-lifting away from read operations to write operations if your application is predominantly read-heavy (as most are).
We already talked about moving file storage out of the database, for all it's goodness, there are better and cheaper purpose built tools out there for providing file storage services. Using SQL Server for file storage requires specific infrastructure and code optimizations to make it viable, you will be far better off migrating away from files in the database.
A practical thing you can do in the database alone is to manage your indexes. Ideally your application and data access layer should be optimised to use specific indexes and in many cases changing the query logic in the ocde is a better solution, however you can work in reverse and start at the database to Tune nonclustered indexes with missing index suggestions. Don't just blindly implement all suggestions, use the feedback and you might identify specific areas in your application that could be enhanced to use indexes better. Also look here Optimize index maintenance to improve query performance and reduce resource consumption
A quick-fix in the application is often to ensure you are using an apprioriate ISOLATION LEVEL on your connections. If there is high DB activity Read Committed often offers the best practical balance between performance and consistency. The main thing to worry about is the transactions themselves. If you are not using long transactions that need to re-query the same data and rely on returning the same records then Read Committed is fine for general use. If you are using long transactions that re-query, consider using temporary tables or table variables or other methods to cache the data rather than re-querying it, use Read Uncommitted with care... If you are not using stored procedures or even transactions, then in your application you may need to take specific steps to set the isolation level for the default transaction scope your application queries execute in.
If you have a high volume of reads from specific functional areas or types of interactions, you might consider Using read-only replicas to offload read-only query workloads. This is not always the best solution, but depending on your application and individual use case you might find that easier to implement at first.
At the application or service level, look for patterns of queries that repeatedly call the database and return the same data, especially for static data that is not likely to change for either known periods of time or the duration of the runtime execution, these will be good candidates for caching or at least using a memoize pattern.
If you have a middle-tier like web services or a hosted API that spans multiple server hosts (think web farm) then you should consider using a distributed cache like REDIS instead of rudimentary local process or file based caching.
Traditional caching isn't always the silver bullet, for slow moving datasets, Implementing Materialized Views can make a lot of sense. This is not something offered OOTB in SQL Server, but you can still implement this concept in your application logic, think of it as manual indexing.
Materialized views are especially useful in situations where complex queries or aggregations are performed frequently, and the underlying data changes infrequently. By storing the precomputed results, the database can avoid the need to execute complex queries repeatedly, resulting in faster response times.
To implement, using either a timer or data change event based trigger, execute a normally expensive query across your data and store the results in another table. Then those read operations that would normally execute the query can be converted into simple statements that query from your improvised Materialized View. This is just another form of cache except we are caching within the database itself.
Many application design concepts can benefit from this concept, in the 90s this concept was the first way that we used to think of when design data driven applications, we had to due to operational costs and physical hardware constraints, it is only since the advancement of cheaper and faster memory and processors that SQL Server has evolved into a beast that allows us to be lazy in our application design and execution and still get away with it.
Materialized Views, once implemented could then also be transitioned either to external databases or other data providers, like REDIS Database or Azure AI Search, or you could use those services instead of manually managing MVs in the database directly.
This is not an exhaustive list of things, just those that immediately sprung to my mind in seeing the OP. There is a lot to unpack, but I hope that some of the links here can direct you to resources and ideas that will help you solve your conundrum. It might be that reaching out for direct mentorship or commercial consultancy to sort this out could come to a resolution quicker, but where is the fun in that ;)