Search code examples
performancedebuggingmemory-managementgoogle-cloud-dataflowapache-beam

Unusual Memory Usage Pattern in GCP Dataflow Jobs Across Multiple Environments


I am currently working with 4 Dataflow jobs, each of which is replicated across 3 separate environments. For the past 3 months, all 12 instances have been running successfully. However, I've observed an odd memory usage pattern in most of these instances, where memory consumption rises in a stair-shaped pattern before crashing abruptly. Fortunately, no data loss has occurred as a result of these crashes.

At the moment, we are operating under a low load, but I'm concerned about potential issues when we eventually transition to full load. Should I be worried about this memory usage pattern? If so, what should be the primary areas I investigate to diagnose and address this issue? Any guidance or insights would be greatly appreciated. enter image description here enter image description here enter image description here


Solution

  • After investigation with google and adding some changes to the jobs we concluded that most probably it was late data that was not handled in the jobs. Adding a configuration that drops late data seemed to solve the issue.