Search code examples
daskgoogle-cloud-dataprocdask-distributed

Using an existing dataproc cluster to run dask


I have a dataproc cluster running on the Google Cloud Platform. I intend to passing this cluster in the dask client instead of initializing a new dask-yarn cluster

However, I am not able to use my dataproc cluster directly

#Instead of :
cluster = YarnCluster(environment='environment.tar.gz',worker_vcores=2, worker_memory="8GiB")
cluster.scale(10)
client = Client(cluster)

#Directly using my dataproc cluster:
client = Client(my-dataproc-cluster)

Solution

  • DataProc creates a new Hadoop cluster, dask-yarn is for creating dask clusters that run inside your hadoop cluster (wherever that may be). To run properly it requires properly setup python environments and configuration, just as any other tool on hadoop would (spark included).

    We don't have a dataproc specific guide, but the one for AWS's equivalent EMR is here: http://yarn.dask.org/en/latest/aws-emr.html

    For deploying on DataProc you'd likely create an equivalent initialization action to the EMR bootstrap action: https://github.com/dask/dask-yarn/blob/master/deployment_resources/aws-emr/bootstrap-dask