Search code examples
pysparkgoogle-kubernetes-enginegoogle-cloud-dataproc

Dataproc on GKE: python packages listed in properties not installed


I created a dataproc cluster on a GKE cluster. The required packages already included inside the properties like examples in here. But when I submitted a job, it failed with an error: ModuleNotFoundError.

...
Waiting for job output...
 PYSPARK_PYTHON=/opt/conda/bin/python
JAVA_HOME=/usr/lib/jvm/temurin-8-jdk-amd64
SPARK_EXTRA_CLASSPATH=
Merging Spark configs
Skipping merging /opt/spark/conf/spark-defaults.conf, file does not exist.
Skipping merging /opt/spark/conf/log4j.properties, file does not exist.
Skipping merging /opt/spark/conf/spark-env.sh, file does not exist.
Skipping custom init script, file does not exist.
Running heartbeat loop
Traceback (most recent call last):
  File "/tmp/spark-d6516b57-0924-4ce2-9de8-a5c1116667b4/pkg.py", line 1, in <module>
    from google.cloud import secretmanager
ModuleNotFoundError: No module named 'google'

This is the gcloud command I used:

gcloud dataproc clusters gke create gke-dp --region=asia-southeast1 --spark-engine-version=3.1 \
--gke-cluster=gke-spark --gke-cluster-location=asia-southeast1-b --namespace=dataproc \
--pools='name=dp-default,roles=default,machineType=n2-standard-2,min=1,max=1' \
--pools='name=dp-workers,roles=spark-driver;spark-executor,machineType=n2-standard-4,min=1,max=4' \
--properties='^#^dataproc:pip.packages=google-cloud-secret-manager==2.15.0,numpy==1.24.1#spark:spark.jars=https://jdbc.postgresql.org/download/postgresql-42.5.1.jar' \
--properties="dataproc:dataproc.gke.agent.google-service-account=dataproc@de-project.iam.gserviceaccount.com" \
--properties="dataproc:dataproc.gke.spark.driver.google-service-account=dataproc@de-project.iam.gserviceaccount.com" \
--properties="dataproc:dataproc.gke.spark.executor.google-service-account=dataproc@de-project.iam.gserviceaccount.com"

Solution

  • This functionality is not supported by Dataproc on GKE.