Search code examples
sre

Understand the thinking behind "slow error is even worse than a fast error"


While reading SRE 4 golden signals in (under the Latency section) https://sre.google/sre-book/monitoring-distributed-systems/

I specifically unable to understand of the below line

On the other hand, a slow error is even worse than a fast error!

What does it mean and if can provide any easy to understand example please?

[Research] While reading a book, I have tried to understand the context but I couldn't able to grasp/visualise it correctly. I did thorough (in my knowledge limit) search on internet but I am sure I am missing out right keywords. Finally have taken a route to ask on Stackoverflow.


Solution

  • Here's the whole paragraph for context:

    Latency

    The time it takes to service a request. It’s important to distinguish between the latency of successful requests and the latency of failed requests. For example, an HTTP 500 error triggered due to loss of connection to a database or other critical backend might be served very quickly; however, as an HTTP 500 error indicates a failed request, factoring 500s into your overall latency might result in misleading calculations. On the other hand, a slow error is even worse than a fast error! Therefore, it’s important to track error latency, as opposed to just filtering out errors.

    I would interpret the sentence you pointed out as follows:

    • A failing request caused by application errors will usually take very little time to finish.
    • When you measure the latency across all requests, regardless of status, these failing requests will skew your statistics downwards. They will drag down the min, avg and median, giving you a wrong impression of a low latency.
    • While most errors will likely be thrown early and hence the latency will be low, there might be cases where a request takes a long time and still ends up with an error. These requests that both take a long time AND fail are worse than those that just fail quickly (and obviously worse than just slow but successful requests). The client (user or upstream system) will not only have to deal with the error, but first it will actually need to wait for the error to occur. That's like waiting in a queue at the dentist, just to be told after 30 mins of waiting that the doctor is not available anyways. You would wish they told you so earlier.
    • Because of these two factors, we want to measure the latency of failing requests, but we want to measure it separately from the succeeding requests.

    In short, the author suggests the following two metrics for latency:

    • Latency of successful requests
    • Latency of failing requests (at least the 5xx kind of errors)

    Hope this helps.