pyspark serverless google-cloud-dataproc dataproc google-cloud-dataproc-serverless

Pyspark with custom container on GCP Dataproc Serverless : access to class in custom container image

I’m trying to start a job Pyspark on GCP Dataproc Serverless with custom container, but when I tried to access to my main class in my custom image, I found this exception:

Exception in thread "main" org.apache.spark.SparkException: Failed to get main class in JAR with error '/var/dataproc/tmp/srvls-batch-10bc1778-798f-4477-b0ea-e8440770784f (Is a directory)'. Please specify one with --class.

To replicate this exception, I just made a hello word and a basic image. My image is hosted on Google Container Registry and here is its contents:

# Base image
FROM centos:7

# Copy the Python source code
COPY helloword.py helloword.py

# Usefull tools
RUN yum install -y curl wget procps

# Versions
ENV TINI_VERSION=v0.19.0

# Install tini
RUN curl -fL "https://github.com/krallin/tini/releases/download/${TINI_VERSION}/tini" -o /usr/bin/tini \
&& chmod +x /usr/bin/tini

# Create the 'spark' group/user.
RUN groupadd -g 1099 spark
RUN useradd -u 1099 -g 1099 -d /home/spark -m spark
USER spark

and this is the command line associated to the job:

gcloud dataproc batches submit pyspark --batch name_batch file://helloword.py \
  --project name_project \
  --region europe-west9 \
  --version 1.1.19 \
  --container-image "eu.gcr.io/name_project/image-test" \
  --subnet default \
  --service-account service_account

You know how I can access to my helloword.py?

Thanks in advance.

Solution

You are seeing this error because file://helloword.py path is relative to the Spark working directory, but you have copied this file to the Docker working directory (/ by default) in your container.

To fix this issue you need to reference this file using an absolute path: file:///helloword.py