We use to spin cluster with below configurations. It used to run fine till last week but now failing with error ERROR: Failed cleaning build dir for libcst Failed to build libcst ERROR: Could not build wheels for libcst which use PEP 517 and cannot be installed directly
Building wheels for collected packages: pynacl, libcst
Building wheel for pynacl (PEP 517): started
Building wheel for pynacl (PEP 517): still running...
Building wheel for pynacl (PEP 517): finished with status 'done'
Created wheel for pynacl: filename=PyNaCl-1.5.0-cp37-cp37m-linux_x86_64.whl size=201317 sha256=4e5897bc415a327f6b389b864940a8c1dde9448017a2ce4991517b30996acb71
Stored in directory: /root/.cache/pip/wheels/2f/01/7f/11d382bf954a093a55ed9581fd66c3b45b98769f292367b4d3
Building wheel for libcst (PEP 517): started
Building wheel for libcst (PEP 517): finished with status 'error'
ERROR: Command errored out with exit status 1:
command: /opt/conda/anaconda/bin/python /opt/conda/anaconda/lib/python3.7/site-packages/pip/_vendor/pep517/_in_process.py build_wheel /tmp/tmpon3bonqi
cwd: /tmp/pip-install-9ozf4fcp/libcst
Cluster configuration command :
gcloud dataproc clusters create cluster-test \
--enable-component-gateway \
--region us-east1 \
--zone us-east1-b \
--master-machine-type n1-highmem-32 \
--master-boot-disk-size 500 \
--num-workers 3 \
--worker-machine-type n1-highmem-16 \
--worker-boot-disk-size 500 \
--optional-components ANACONDA,JUPYTER,ZEPPELIN \
--image-version 1.5.54-ubuntu18 \
--tags <tag-name> \
--bucket '<cloud storage bucket>' \
--initialization-actions 'gs://goog-dataproc-initialization-actions-us-east1/connectors/connectors.sh','gs://goog-dataproc-initialization-actions-us-east1/python/pip-install.sh' \
--metadata='PIP_PACKAGES=wheel datalab xgboost==1.3.3 shap oyaml click apache-airflow apache-airflow-providers-google' \
--initialization-action-timeout 30m \
--metadata gcs-connector-version=2.1.1,bigquery-connector-version=1.1.1,spark-bigquery-connector-version=0.17.2 \
--project <project-name>
Things I tried: a) I tried to install wheel package explicitly as part of pip packages but the issue does not resolve
b) Gcloud Command with upgrade pip script:
gcloud dataproc clusters create cluster-test \
--enable-component-gateway \
--region us-east1 \
--zone us-east1-b \
--master-machine-type n1-highmem-32 \
--master-boot-disk-size 500 \
--num-workers 3 \
--worker-machine-type n1-highmem-16 \
--worker-boot-disk-size 500 \
--optional-components ANACONDA,JUPYTER,ZEPPELIN \
--image-version 1.5.54-ubuntu18 \
--tags <tag-name> \
--bucket '<cloud storage bucket>' \
--initialization-actions 'gs://goog-dataproc-initialization-actions-us-east1/connectors/connectors.sh','gs://<bucket-path>/upgrade-pip.sh','gs://goog-dataproc-initialization-actions-us-east1/python/pip-install.sh' \
--metadata='PIP_PACKAGES=wheel datalab xgboost==1.3.3 shap oyaml click apache-airflow apache-airflow-providers-google' \
--initialization-action-timeout 30m \
--metadata gcs-connector-version=2.1.1,bigquery-connector-version=1.1.1,spark-bigquery-connector-version=0.17.2 \
--project <project-name>
Seems you need to upgrade pip
, see this question.
But there can be multiple pip
s in a Dataproc cluster, you need to choose the right one.
For init actions, at cluster creation time, /opt/conda/default
is a symbolic link to either /opt/conda/miniconda3
or /opt/conda/anaconda
, depending on which Conda env you choose, the default is Miniconda3, but in your case it is Anaconda. So you can run either /opt/conda/default/bin/pip install --upgrade pip
or /opt/conda/anaconda/bin/pip install --upgrade pip
.
For custom images, at image creation time, you want to use the explicit full path, /opt/conda/anaconda/bin/pip install --upgrade pip
for Anaconda, or /opt/conda/miniconda3/bin/pip install --upgrade pip
for Miniconda3.
So, you can simply use /opt/conda/anaconda/bin/pip install --upgrade pip
for both init actions and custom images.