Search code examples

Dataproc Cluster creation is failing with PIP error "Could not build wheels"

We use to spin cluster with below configurations. It used to run fine till last week but now failing with error ERROR: Failed cleaning build dir for libcst Failed to build libcst ERROR: Could not build wheels for libcst which use PEP 517 and cannot be installed directly

Building wheels for collected packages: pynacl, libcst
  Building wheel for pynacl (PEP 517): started
  Building wheel for pynacl (PEP 517): still running...
  Building wheel for pynacl (PEP 517): finished with status 'done'
  Created wheel for pynacl: filename=PyNaCl-1.5.0-cp37-cp37m-linux_x86_64.whl size=201317 sha256=4e5897bc415a327f6b389b864940a8c1dde9448017a2ce4991517b30996acb71
  Stored in directory: /root/.cache/pip/wheels/2f/01/7f/11d382bf954a093a55ed9581fd66c3b45b98769f292367b4d3
  Building wheel for libcst (PEP 517): started
  Building wheel for libcst (PEP 517): finished with status 'error'
  ERROR: Command errored out with exit status 1:
   command: /opt/conda/anaconda/bin/python /opt/conda/anaconda/lib/python3.7/site-packages/pip/_vendor/pep517/ build_wheel /tmp/tmpon3bonqi
       cwd: /tmp/pip-install-9ozf4fcp/libcst

Cluster configuration command :

gcloud dataproc clusters create cluster-test \
--enable-component-gateway \
--region us-east1 \
--zone us-east1-b \
--master-machine-type n1-highmem-32 \
--master-boot-disk-size 500 \
--num-workers 3 \
--worker-machine-type n1-highmem-16 \
--worker-boot-disk-size 500 \
--optional-components ANACONDA,JUPYTER,ZEPPELIN \
--image-version 1.5.54-ubuntu18 \
--tags <tag-name> \
--bucket '<cloud storage bucket>' \
--initialization-actions 'gs://goog-dataproc-initialization-actions-us-east1/connectors/','gs://goog-dataproc-initialization-actions-us-east1/python/' \
--metadata='PIP_PACKAGES=wheel datalab xgboost==1.3.3 shap oyaml click apache-airflow apache-airflow-providers-google' \
--initialization-action-timeout 30m \
--metadata gcs-connector-version=2.1.1,bigquery-connector-version=1.1.1,spark-bigquery-connector-version=0.17.2 \
--project <project-name>

Things I tried: a) I tried to install wheel package explicitly as part of pip packages but the issue does not resolve

b) Gcloud Command with upgrade pip script:

gcloud dataproc clusters create cluster-test \
--enable-component-gateway \
--region us-east1 \
--zone us-east1-b \
--master-machine-type n1-highmem-32 \
--master-boot-disk-size 500 \
--num-workers 3 \
--worker-machine-type n1-highmem-16 \
--worker-boot-disk-size 500 \
--optional-components ANACONDA,JUPYTER,ZEPPELIN \
--image-version 1.5.54-ubuntu18 \
--tags <tag-name> \
--bucket '<cloud storage bucket>' \
--initialization-actions 'gs://goog-dataproc-initialization-actions-us-east1/connectors/','gs://<bucket-path>/','gs://goog-dataproc-initialization-actions-us-east1/python/' \
--metadata='PIP_PACKAGES=wheel datalab xgboost==1.3.3 shap oyaml click apache-airflow apache-airflow-providers-google' \
--initialization-action-timeout 30m \
--metadata gcs-connector-version=2.1.1,bigquery-connector-version=1.1.1,spark-bigquery-connector-version=0.17.2 \
--project <project-name>


  • Seems you need to upgrade pip, see this question.

    But there can be multiple pips in a Dataproc cluster, you need to choose the right one.

    1. For init actions, at cluster creation time, /opt/conda/default is a symbolic link to either /opt/conda/miniconda3 or /opt/conda/anaconda, depending on which Conda env you choose, the default is Miniconda3, but in your case it is Anaconda. So you can run either /opt/conda/default/bin/pip install --upgrade pip or /opt/conda/anaconda/bin/pip install --upgrade pip.

    2. For custom images, at image creation time, you want to use the explicit full path, /opt/conda/anaconda/bin/pip install --upgrade pip for Anaconda, or /opt/conda/miniconda3/bin/pip install --upgrade pip for Miniconda3.

    So, you can simply use /opt/conda/anaconda/bin/pip install --upgrade pip for both init actions and custom images.