Search code examples
javakubernetespipelinespring-cloud-dataflow

Spring Cloud Data Flow : Programmatical Orchestration of Tasks


Background

I have Spring Cloud Data Flow Server running in Kubernetes as a Pod. I am able to launch tasks from the SCDF server UI dashboard. I am looking to develop a more complicated, real-world task- pipeline use-case.

Instead of using the SCDF UI dashboard, I want to launch a sequential list of tasks from a standard Java application. Consider the following task pipeline :

Task 1 : Reads data from the database for the unique id received as task argument input and performs enrichments. The enriched record is written back to the database. Execution of one task instance is responsible for processing one unique id.

Task 2 : Reads the enriched data written by step 1 for the unique id received as task argument input and generates reports. Execution of one task instance is responsible for generating reports for one unique id.

It should be clear from the above explanation that Task 1 and Task 2 are sequential steps. Assume that the input database contains 50k unique ids. I want to develop an orchestrator Java program that would launch task 1 with a limit of 40. (i.e only 40 pods can be running at any given time for task 1. Any requests to launch more pods for task 1 should be put on wait). Once all 50k unique ids have been processed through Task 1 instances, only then can Task 2 pods should be launched.

What I found so far

Going through the documentation, I found something known as the CompositeTaskRunner. However, the examples show commands triggered on a shell/cmd window. I want to do something similar but instead of opening up a data-flow shell program, I want to pass arguments to a Java program that can internally launch tasks. This allows me to easily integrate my application with legacy code that knows how to integrate with Java code (Either by launching a Java program on-demand that should launch a set of tasks and wait for them to complete or by calling a Rest API).

Question

  1. How to programmatically launch tasks on-demand with Spring Cloud Data Flow using Java instead of a data-flow shell? (Is there a Rest-API to do this or a simple Java program that will be run on a stand alone server should be fine too)
  2. How to programmatically build a sequential pipeline with an upper limit on the number of pods that can be launched per task and with dependencies such that a task can only start once the previous task completed processing all the inputs.

Solution

  • Please review the Java DSL support for Tasks.

    You'd be able to compose the choreography of the tasks with sequential/parallel execution with this fluent-style API. [example: .definition("a: timestamp && b:timestamp")]

    With this defined as Java code, you'd be able to build, launch or schedule the launching of these directed graphs. We see many customers following this approach for E2E acceptance testing and deployment automation.

    [ADD]

    Furthermore, you can extend the programmatic task definitions for continuous deployments, as well.