Search code examples
google-cloud-platformgoogle-cloud-data-fusioncdap

GCP datafusion is too slow in executing the pipelines


I understand that datafusion is a managed service on CDAP but the current 6.1.1 enterpise edition is too slow compared to CDAP OSS (which is in Google Market place). It is taking approx ~3 minutes for provisioning the dataproc nodes (whatever the compute profile is), approx 1.5 minutes to start and running mode and then the data will start flowing through nodes. Are there any ways to optimize this and bring up to the speed ?


Solution

  • CDAP OSS that is in Google Market place is running in memory, and suggested only for development, as the execution engine cannot scale.

    If you want to optimize the provisioning of Dataproc cluster, you can pre-provision Dataproc cluster yourself, and use the Remote Hadoop Provisioner compute profile to submit the job instead.