I am trying to see feasibility of replacing the hundreds of feed file ETL jobs created using SSIS packages with apache flink jobs (and kuberentes as underlying infra). One recommendation i saw in some article is "to use one flink cluster for one type of job".
Since i have handful jobs per day of each job type, then this means the best way for me is to create flinkcluster on the fly when executing the job and destroy it to free up resources, is that correct way to do it? I am setting up flinkcluster without job manager.
Any suggestions on best practices for using flink for batch ETL activities.
May be most important question: is flink correct solution for the problem statement or should i go more into Talend and other classic ETL tools?
Flink is well suited for running ETL workloads. The two deployment modes give you the following properties:
A session cluster allows to run several jobs on the same set of resources (TaskExecutors
). You start the session cluster before submitting any resources.
A
can cause job B
to restartA
runs in the same JVM as job B
and hence can influence it if statics
are usedA per-job cluster starts a dedicated Flink cluster for every job.
TaskExecutors
So if you have many short lived ETL jobs which require a fast response, then I would suggest to use a session cluster because you can avoid the cluster start up time for every job. If the ETL jobs have a long runtime, then this additional time will carry no weight and I would choose the per-job mode which gives you more predictable runtime behaviour because of strict job isolation.