Search code examples
pysparkamazon-emr

How to use my own jar as dependency in AWS EMR


I'm submitting a Spark driver program to the EMR cluster, and it need to use a jar uploaded by me, so mu code was like:

boto3.client("emr-containers").start_job_run(
    name=job_name,
    virtualClusterId=self.virtual_cluster_id,
    releaseLabel="emr-6.11.0-latest",
    executionRoleArn=role,
    jobDriver={
        "sparkSubmitJobDriver": {
            "entryPoint": entry_point,
            "entryPointArguments": entry_point_args,
            "sparkSubmitParameters": '--driver-class-path s3://my_bucket/mysql-connector-j-8.0.32.jar --jars s3://my_bucket/mysql-connector-j-8.0.32.jar --conf spark.kubernetes.driver.podTemplateFile=my_file.yaml --conf spark.kubernetes.executor.podTemplateFile=my_file.yaml',
        }
    },
)

But this would cause EMR throwns an exception:

Exception in thread "main" java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.amazon.ws.emr.hadoop.fs.EmrFileSystem not found

According to documents and some other post in stackoverflow, I suspect it was because my driver-class-path and jars args had overwritten the default setting of EMR. So what args should I parse to EMR in order to use my own jar in EMR and avoid the problem above?


Solution

  • On EMR Serverless you can add jars in sparkSubmitParameters like this: --conf spark.jars=s3://my-bucket/multiple-jars/*. I suspect it should be similar for EMR on EKS link