python amazon-web-services aws-lambda aws-step-functions

What is the best practice for handling unexpected errors in an AWS Step Function using Python Lambdas?

Step Functions are AWS structures that control the flow of lambdas (or other events). All my lambdas use Python (but Lambdas can use most major languages). Throughout the process my step function sends status updates back to the client (the client triggered it via API). Let's say it progresses through these updates: Started -> In Progress -> Finishing -> Done. For handled errors it will send an 'Error' status back to the client. So the client could see a timeline like this: Started -> In Progress -> Errored. This is ideal - so the user knows the process has stopped.

But when there are unexpected/unhandled errors the client never really knows and the timeline might sit at 'In Progress' indefinitely - the user doesn't know what happened. So I started looking into the built-in Step Function error handling. I like this option because I can create a 'Catch' function for each lambda or event where I can communicate back to the client if there is an error. The downside to this was that it really made the step function template/design messy see the before/after screenshots below.

BEFORE---------------

AFTER---------------

The template code that generates these graphs doesn't look much better. So I considered an alternative which seems similarly messy. I could add a single try/except block within each lambda for the entire lambda - to catch any/all errors. For example:

def lambda_handler(event, context):
    try:
        #Execute function tasks
    except:
        #Communicate back to client that there was an error

Similar to the step function 'Catch' functions this would ensure that I catch and communicate any error. But this seems like a bad idea just because of what it is (adding blanket/blind try/except).

So right now I'm stuck between messy/repeated code and try/except-ing everything. Am I implementing step function 'Catch' incorrectly? Am I missing a better way to handle unknown Python errors? Is there another approach entirely?

Solution

As @stijndepestel pointed out, having a catch-all error check is a good idea.

What I do in my Python Lambda functions is this: I have a custom router class, which besides route managing, it handles all errors. If the error inherits from a base error class that I've created, then it's custom error that I threw, and those are assigned special info when I created them that automatically gets formatted when they are converted into strings. The router sends that back to the client if possible.

But if the error is some unknown/unexpected one, then the router prints it with as much detail as possible to CloudWatch Logs, and then returns a generic "500 Internal Server Error" message to the client.

I'd probably set it up in the future to notify me by email or something like that when such errors occur, so that I can take action quickly.