Search code examples
google-cloud-platformcloudpersistencegoogle-cloud-dataflowgoogle-cloud-bigtable

Dataflow - State persistence database


We are considering using Beam/Dataflow for stateful processing:

  • Real-time aggregation of metrics on global windows (every 1min)
  • Real-time aggregation on a high number of parallel sessions (> 1 mio)

Example: get max price article bought for each 1 mio clients since registered on a portal

Also, we would also like to access those calculated aggregates while not interfering with the real-time job.

Design question : can it be covered by the current state back-end - Windmill/Persistent Disks [1] - or would the use of a database (like BigTable) be a better fit ?

Thanks !

[1] Dataflow - State persistence?


Solution

  • It is actually possible to define Big Table connectors in Dataflow to perform reading and writing operations. Moreover, there is the project.jobs.get method of the Dialogflow API that returns an instance of a job which is a json response containing also the “currentState” field. Therefore I think you could build a sort of automation script to get this field value and then store it in Big Table database using the Big Table connectors, however it is a quite complex solution and I am not sure if it could be convenient.