apache-flink etl apache-beam flink-batch

Use of flink/kubernetes to replace etl jobs (on ssis) : one flink cluster per jobtype or create and destroy flink cluster per job execution

I am trying to see feasibility of replacing the hundreds of feed file ETL jobs created using SSIS packages with apache flink jobs (and kuberentes as underlying infra). One recommendation i saw in some article is "to use one flink cluster for one type of job".

Since i have handful jobs per day of each job type, then this means the best way for me is to create flinkcluster on the fly when executing the job and destroy it to free up resources, is that correct way to do it? I am setting up flinkcluster without job manager.

Any suggestions on best practices for using flink for batch ETL activities.

May be most important question: is flink correct solution for the problem statement or should i go more into Talend and other classic ETL tools?

Solution

Flink is well suited for running ETL workloads. The two deployment modes give you the following properties:

Session cluster

A session cluster allows to run several jobs on the same set of resources (TaskExecutors). You start the session cluster before submitting any resources.

Benefits:

No additional cluster deployment time needed when submitting jobs => Faster job submissions
Better resource utilization if individual jobs don't need many resources
One place to control all your jobs

Downsides:

No strict isolation between jobs
- Failures caused by job A can cause job B to restart
- Job A runs in the same JVM as job B and hence can influence it if statics are used

Per-job cluster

A per-job cluster starts a dedicated Flink cluster for every job.

Benefits

Strict job isolation
More predictable resource consumption since only a single job runs on the TaskExecutors

Downsides

Cluster deployment time is part of the job submission time, resulting in longer submission times
Not a single cluster which controls all your jobs

Recommendation

So if you have many short lived ETL jobs which require a fast response, then I would suggest to use a session cluster because you can avoid the cluster start up time for every job. If the ETL jobs have a long runtime, then this additional time will carry no weight and I would choose the per-job mode which gives you more predictable runtime behaviour because of strict job isolation.