Search code examples
google-cloud-datastoregoogle-cloud-dataflowapache-beam

Dataflow writing to datastore poor performance?


Lately I have updated my dataflow apache beam pipeline to the latest version, my pipeline writes a huge amount of data. The pipeline before apache beam version update from 2.27 to 2.41 takes about 8 min to finish executing while after the update it takes more than 30 min to finish executing.

Before the Update

enter image description here

After the update

enter image description here

The Enforce ramp-up through throttling step wasn't shown before updating the pipeline version.

Update: As mentioned in the Apache Beam changes on updates in version 2.32.0 that:

DatastoreIO: Write and delete operations now follow automatic gradual ramp-up, in line with best practices (Java/Python)

Where I think that the latency in writing occurs because of this update!!


Solution

  • I checked with the team and generally speaking that's the expected behavior. The settings for the IO have those as standard settings to follow best practices for ramp-up, and not using it is possible, but discouraged.

    DatastoreV1 docs can provide for further guidance:

    Write and delete operations will follow a gradual ramp-up by default in order to protect Cloud Datastore from potential overload. This rate limit follows a heuristic based on the expected number of workers. To optimize throughput in this initial stage, you can provide a hint to the relevant PTransform by calling withHintNumWorkers, e.g., DatastoreIO.v1().deleteKey().withHintNumWorkers(numWorkers). While not recommended, you can also turn this off via .withRampupThrottlingDisabled().