performance debugging memory-management google-cloud-dataflow apache-beam

Unusual Memory Usage Pattern in GCP Dataflow Jobs Across Multiple Environments

I am currently working with 4 Dataflow jobs, each of which is replicated across 3 separate environments. For the past 3 months, all 12 instances have been running successfully. However, I've observed an odd memory usage pattern in most of these instances, where memory consumption rises in a stair-shaped pattern before crashing abruptly. Fortunately, no data loss has occurred as a result of these crashes.

At the moment, we are operating under a low load, but I'm concerned about potential issues when we eventually transition to full load. Should I be worried about this memory usage pattern? If so, what should be the primary areas I investigate to diagnose and address this issue? Any guidance or insights would be greatly appreciated.

Solution

After investigation with google and adding some changes to the jobs we concluded that most probably it was late data that was not handled in the jobs. Adding a configuration that drops late data seemed to solve the issue.