Search code examples
google-cloud-dataproc

In Google Dataproc templates GCS to Bigtable dependencies


In GCS To Bigtable template (https://github.com/GoogleCloudPlatform/dataproc-templates/tree/main/python/dataproc_templates/gcs#gcs-to-bigtable) It lists hbase-spark-protocol-shaded.jar and hbase-spark.jar as dependencies.

I receive the following error:

py4j.protocol.Py4JJavaError: An error occurred while calling o90.save.
: java.lang.NoClassDefFoundError: scala/Product$class
    at org.apache.hadoop.hbase.spark.HBaseRelation.<init>(DefaultSource.scala:97)
    at org.apache.hadoop.hbase.spark.DefaultSource.createRelation(DefaultSource.scala:79)

...

Caused by: java.lang.ClassNotFoundException: scala.Product$class
    at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:476)
    at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:589)
    at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522)
    ... 43 more

If add the following to my python job

print(f"/usr/lib/spark/external/: {os.listdir('/usr/lib/spark/external/')}", flush=True)

It shows

/usr/lib/spark/external/: ['spark-token-provider-kafka-0-10.jar', 'spark-avro_2.12-3.3.1.jar', 'spark-sql-kafka-0-10_2.12-3.3.1.jar', 'spark-sql-kafka-0-10.jar', 'spark-streaming-kafka-0-10-assembly.jar', 'spark-avro.jar', 'spark-streaming-kafka-0-10-assembly_2.12-3.3.1.jar', 'spark-token-provider-kafka-0-10_2.12-3.3.1.jar']

Which does not have the hbase-spark-protocol-shaded.jar nor the hbase-spark.jar.

Where do I retrieve the correct versions from?

I'm using the --version=1.1 flag in the gcloud command for the GCP Spark runtime (ref: https://cloud.google.com/dataproc-serverless/docs/concepts/versions/spark-runtime-versions#spark_runtime_version_11)


Solution

  • hbase-spark.jar and hbase-spark-protocol-shaded.jar will be included in the runtime 1.1 in next subminor runtime version (1.1.4) that should roll out by Feb 24.

    In meantime, as a workaround you can use runtime 1.0.