pyspark google-bigquery spark-bigquery-connector

PySpark with BiqQuery connector. Failed to find data source: bigquery

I need to write PySpark's result to BiqQuery. According to https://github.com/GoogleCloudDataproc/spark-bigquery-connector , i use following:


    from pyspark.sql import SparkSession

    spark = SparkSession.builder\
            .config("spark.jars.packages",\
                "com.google.cloud.spark:spark-bigquery-with-dependencies_2.12:0.27.1")\
            .getOrCreate()
    spark_context = spark.sparkContext

Each attemp of saving,


    data.toDF(schema) \
                .write.format("bigquery") \
                .option("table", "tmp-project:tmpdataset.tmp_table") \
                .save()

leads to an exception:

*java.lang.ClassNotFoundException: Failed to find data source: bigquery. Please find packages at http://spark.apache.org/third-party-projects.html*

Also tried, but same result:

setup reference directly to gcs 'gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar'
download locally 'spark-bigquery-latest_2.12.jar' and setup local path. The file according to logs definitely exists.
option to setup the jar as argument like pyspark --jars gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.12-0.27.1.jar is not available for right now.
Updated the format from "bigquery" to "com.google.cloud.spark.bigquery".

PySpark version 3.0.0 scala version 2.12.10

The code below returns an empty result:

[spark_context._jsc.sc().jars().apply(i) for i in range(jvc.sc().jars().length())]

UPD Upgrade spark to 3.1.1, using using local downloaded jar with chmod 777 under it changed behavior, but have not solve the issue yet:

spark_context._jsc.sc().listJars()

returns Vector(spark://<...>.svc:<port>/jars/spark-bigquery-with-dependencies_2.13-0.27.1.jar)

spark_context._jsc.sc().jars()

returns ArrayBuffer(./<...>/spark-bigquery-with-dependencies_2.13-0.27.1.jar)

Also new log appeared:

SparkContext: Added JAR ./<...>/spark-bigquery-with-dependencies_2.13-0.27.1.jar at spark://<...>.svc:<port>/jars/spark-bigquery-with-dependencies_2.13-0.27.1.jar with timestamp <timestamp>

Solution

The solution was to copy all jar files to /opt/spark/jars/. Just keeping this file locally on docker container and load in runtime did not help, but moving to exact this path - helped. If anyone else will bump with this issue, you can try to tune SPARK_CLASSPATH env - it might also help. SPARK_CLASSPATH was referenced to /opt/spark/jars/ in my case.