I've got an infrastructure that effectively ingests and processes objects and images.
I've currently got around 20 lambdas in my infrastructure - from determining which Objects need to get scheduled, queueing Objects ingestion, doing the ingestion, processing the results based on data in the given Object, and saving various data and meta data. Basically every task relates to a given Object as it moves through the processing pipeline.
The system as a whole is stable and cranks millions of lambda invocations per day; workign with about a thousand Objects (and growing)... each Object processing 1-1500 times per day depending on that Object's specific configuration. These Objects are real-world elements that often fail, have delays, other issues, etc. Occasional errors for a given Object are expected -- and we generate thousands of 'errors' per day.
Cloudwatch logs serve me well for determining issues with a specific lambda -- did this lambda crash, why, etc.
However, I'd like to be able to better troubleshoot the entire pipeline. I struggle to answer the question "Why is Object X failing" because it's really tough to track a given Object conditionally processing across dozens of lambdas.
In my head, logging by saving to dynamo with the PK: ObjectID
and SK: ISODate
then the message would let me then build a quick endpoint that lets me say "Give me all the logs about Object X". So I'd see logs that say "Object X was enqueued", "Object X was ingested", "Object X had an error when processing XYZ". That said, I feel like this might consume a lot of WCUs and could end up being somewhat expensive? I'd need to run a pricing calculation, of course.
Being able to put this on a custom dashboard would be great (95% of my users don't have access to the AWS console). If there was a good API for cloudwatch, I could probably make things work with a "Give me every log that includes string XYZ" from date A to date B. Honestly, that could be the easiest option...??
What are some good options?
Looks to me like your question is the definition of distributed tracing.
Distributed tracing, also known as distributed request tracing, is a method of monitoring and observing service requests in applications built on a microservices architecture. Distributed tracing is used by IT and DevOps teams to track requests or transactions through the application they are monitoring — gaining vital end-to-end observability into that journey. This lets them identify any issues, including bottlenecks and bugs, that could be having a negative impact on the application’s performance and affect user experience.
(Source)
In practice, how distributed tracing works is that you assign a trace to the Object (in your case) when it enters the system. As it progresses, each and every one of your Lambdas will know about that trace and emit information under that trace ID. Once the Object is processed (or fails processing), you can use the trace ID to view the high level picture of your system and dive deep into each component.
AWS offers such system through X-Ray. I recommend you take a look at the X-Ray documentation and try it out.
Example view from X-Ray: