Search code examples
google-cloud-dataflow

Deploying Dataflow job that runs for X hours


We are deploying/triggering Dataflow streaming jobs through Airflow using flex template. We want these streaming jobs to run, say for 24 hours (or until a certain clock time), then stop/cancel on its own. Is there a parameter in Dataflow (pipeline setting like max workers) that will do this?


Solution

  • I think there is no parameter and automatic approach to stop or drain a Dataflow job.

    You can do that with an Airflow dag. Example you can create a cron dag with Airflow (every 24 hours) having the responsability to stop or drain the Dataflow job, there is a built in operator to do that :

    stop_dataflow_job = DataflowStopJobOperator(
        task_id="stop-dataflow-job",
        location="europe-west3",
        job_name_prefix="start-template-job",
    )
    

    To stop one or more Dataflow pipelines you can use DataflowStopJobOperator. Streaming pipelines are drained by default, setting drain_pipeline to False will cancel them instead. Provide job_id to stop a specific job, or job_name_prefix to stop all jobs with provided name prefix.