azure timer azure-functions azure-logic-apps

How to avoid loosing the requests in process when Azure Logic App Crash

We set up an event grid and a logic app as a subscriber to the event grid. One of our core requirements is not loose the requests once the event grid receives the requests. Not loosing requests means this system must come back to the caller either successful or fail.

So what we have,

Time-to-live in Event Grid: when the logic app is dead (does not pull the requests from Event Grid), the requests fallout to "Time-to-live" queue and notify the caller "fail"
Logic App timeout: When some parts of the Logic app fails or loop, the timeout will occur, we notify the caller "fail"
Logic app runs smoothly, then "Success". (over simplifying)

Now a question is what if the logic app crashes entirely? Because if the logic app crashes, then its timeout (2 above) will not be functional? So therefore we never be able to return to the caller?

Are there solutions we can do in the infrastructure level without building a complex mechanism?

For example, Logic app disaster recovery, set up two instances in different regions?

Or should we do something like

should we create another timer that exists completely separate from the logic app? So the additional timer won't go down together with the logic app?
Should we save the request statuses as the logic app progresses, then create another function app to look at the request statuses, and when the logic app comes back up, the function app pick the requests up based on the statues and push them again to the logic app?

Thank you kindly

Looked at MS logic app technical documentation

Solution

Interesting problem you got there. Im just gonna throw some thoughts into the void to shape some thoughts :)

If you are worried about your LA crashing and not being able to notify your event source about the crash, there is certainly only two alternatives, and that is redundancy and/or event durability.

redundancy: By fanning out your solution and letting multiple workers handle the events sent by Event Grid, you would increase the odds of the message actually making it back to the event source. This solution however, requires that the receiver of the "success / fail" message can handle duplicates. This solution, is in my opinion a lazy solution that will work in the short run, but does not really solve the problem.

event durability: What I think you should do is to involve a Service Bus in one way or another to use the benefit of the Dead Letter Queue / Max delivery count. The first problem struggle with this is, how do we get the event into the Service Bus Topic/Queue? Well, we could use a few function apps (more than one in different areas to not have a single point of failure) to ingest the data into the service bus, see image:

The function apps will in the normal case ingest multiple copies of the event to the service bus, thats where the Duplicate detection comes in and saves us. We can now trigger the Logic App normally and let it handle the event. Once the event is marked as succeeded, the message is removed from the service bus and a succeed message is sent to the event source.

In case of Logic App failure: In case the logic app crashes and the message can not be completed, the message is either ran again once the logic app comes online or the message is put on the dead letter queue. If the message is put on the dead letter queue, you could have another Logic App / Function App that triggers and sends the fail message to the event source (from another region, to limit the chance that resource being down as well).

This solution might be a bit to much, but I'm just throwing thoughts out there to trigger your imagination.