Search code examples
jupyter-notebookgoogle-cloud-dataprocgoogle-cloud-datalab

Why can't I create a Google DataProc cluster with both Jupyter and DataLab installed?


I want to create a cluster in DataProc with both Jupyter and DataLab installed (I understand they are very similar but team members have different preference). I can create cluster with any of them:

Cluster with Jupyter:

gcloud dataproc clusters create $DATAPROC_CLUSTER_NAME_JUPYTER \
--project $PROJECT \
--bucket $BUCKET \
--zone $ZONE \
--initialization-actions gs://dataproc-initialization-actions/connectors/connectors.sh,gs://dataproc-initialization-actions/jupyter/jupyter.sh \
--metadata gcs-connector-version=$GCS_CONNECTOR_VERSION \
--metadata bigquery-connector-version=$BQ_CONNECTOR_VERSION \
--metadata JUPYTER_PORT=$JUPYTER_PORT,JUPYTER_CONDA_PACKAGES=numpy:scipy:pandas:scikit-learn

Cluster with DataLab:

gcloud dataproc clusters create $DATAPROC_CLUSTER_NAME_DATALAB \
--project $PROJECT \
--bucket $BUCKET \
--zone $ZONE \
--master-boot-disk-size $MASTER_DISK_SIZE \
--worker-boot-disk-size $WORKER_DISK_SIZE \
--initialization-actions gs://dataproc-initialization-actions/connectors/connectors.sh,gs://dataproc-initialization-actions/datalab/datalab.sh \
--metadata gcs-connector-version=$GCS_CONNECTOR_VERSION \
--metadata bigquery-connector-version=$BQ_CONNECTOR_VERSION \
--scopes cloud-platform,bigquery

And both work well. However, when I try to create a cluster with both of them, it fails:

gcloud dataproc clusters create test \
--project $PROJECT \
--bucket $BUCKET \
--zone $ZONE \
--initialization-actions gs://dataproc-initialization-actions/connectors/connectors.sh,gs://dataproc-initialization-actions/datalab/datalab.sh,gs://dataproc-initialization-actions/jupyter/jupyter.sh \
--metadata gcs-connector-version=$GCS_CONNECTOR_VERSION \
--metadata bigquery-connector-version=$BQ_CONNECTOR_VERSION \
--metadata JUPYTER_PORT=$JUPYTER_PORT,JUPYTER_CONDA_PACKAGES=numpy:scipy:pandas:scikit-learn \
--scopes cloud-platform,bigquery

The error message are:

ERROR: (gcloud.dataproc.clusters.create) Operation [projects/abc/regions/global/operations/d34943dc-5bda-386f-af91-db6e0516e2c5] failed: Multiple Errors:
 - Initialization action failed. Failed action 'gs://dataproc-initialization-actions/jupyter/jupyter.sh', see output in: gs://abc/google-cloud-dataproc-metainfo/266175ef-e595-4732-b351-335837a3f30e/test-m/dataproc-initialization-script-2_output
 - Initialization action failed. Failed action 'gs://dataproc-initialization-actions/jupyter/jupyter.sh', see output in: gs://abc/google-cloud-dataproc-metainfo/266175ef-e595-4732-b351-335837a3f30e/test-w-0/dataproc-initialization-script-2_output
 - Initialization action failed. Failed action 'gs://dataproc-initialization-actions/jupyter/jupyter.sh', see output in: gs://abc/google-cloud-dataproc-metainfo/266175ef-e595-4732-b351-335837a3f30e/test-w-1/dataproc-initialization-script-2_output.

The file in test-m looks like following:

++ /usr/share/google/get_metadata_value attributes/dataproc-role
+ readonly ROLE=Worker
+ ROLE=Worker
++ /usr/share/google/get_metadata_value attributes/INIT_ACTIONS_REPO
++ echo https://github.com/GoogleCloudPlatform/dataproc-initialization-actions.git
+ readonly INIT_ACTIONS_REPO=https://github.com/GoogleCloudPlatform/dataproc-initialization-actions.git
+ INIT_ACTIONS_REPO=https://github.com/GoogleCloudPlatform/dataproc-initialization-actions.git
++ /usr/share/google/get_metadata_value attributes/INIT_ACTIONS_BRANCH
++ echo master
+ readonly INIT_ACTIONS_BRANCH=master
+ INIT_ACTIONS_BRANCH=master
++ /usr/share/google/get_metadata_value attributes/JUPYTER_CONDA_CHANNELS
+ readonly JUPYTER_CONDA_CHANNELS=
+ JUPYTER_CONDA_CHANNELS=
++ /usr/share/google/get_metadata_value attributes/JUPYTER_CONDA_PACKAGES
+ readonly JUPYTER_CONDA_PACKAGES=numpy:scipy:pandas:scikit-learn
+ JUPYTER_CONDA_PACKAGES=numpy:scipy:pandas:scikit-learn
+ echo 'Cloning fresh dataproc-initialization-actions from repo https://github.com/GoogleCloudPlatform/dataproc-initialization-actions.git and branch master...'
Cloning fresh dataproc-initialization-actions from repo https://github.com/GoogleCloudPlatform/dataproc-initialization-actions.git and branch master...
+ git clone -b master --single-branch https://github.com/GoogleCloudPlatform/dataproc-initialization-actions.git
fatal: destination path 'dataproc-initialization-actions' already exists and is not an empty directory.

Looks like there is a clone step which prevents the installation from success. How can I solve this? Any suggestion is appreciated, thank you.


Solution

  • This appears to be a bug in the init actions where we can't git clone the repository twice. We will fix this.

    In the mean time, you can try Jupyter optional component instead with datalab init action.