I need to write PySpark's result to BiqQuery. According to https://github.com/GoogleCloudDataproc/spark-bigquery-connector , i use following:
from pyspark.sql import SparkSession
spark = SparkSession.builder\
.config("spark.jars.packages",\
"com.google.cloud.spark:spark-bigquery-with-dependencies_2.12:0.27.1")\
.getOrCreate()
spark_context = spark.sparkContext
Each attemp of saving,
data.toDF(schema) \
.write.format("bigquery") \
.option("table", "tmp-project:tmpdataset.tmp_table") \
.save()
leads to an exception:
*java.lang.ClassNotFoundException: Failed to find data source: bigquery. Please find packages at http://spark.apache.org/third-party-projects.html*
Also tried, but same result:
pyspark --jars gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.12-0.27.1.jar
is not available for right now.PySpark version 3.0.0 scala version 2.12.10
The code below returns an empty result:
[spark_context._jsc.sc().jars().apply(i) for i in range(jvc.sc().jars().length())]
UPD
Upgrade spark to 3.1.1, using using local downloaded jar with chmod 777
under it changed behavior, but have not solve the issue yet:
spark_context._jsc.sc().listJars()
returns Vector(spark://<...>.svc:<port>/jars/spark-bigquery-with-dependencies_2.13-0.27.1.jar)
spark_context._jsc.sc().jars()
returns ArrayBuffer(./<...>/spark-bigquery-with-dependencies_2.13-0.27.1.jar)
Also new log appeared:
SparkContext: Added JAR ./<...>/spark-bigquery-with-dependencies_2.13-0.27.1.jar at spark://<...>.svc:<port>/jars/spark-bigquery-with-dependencies_2.13-0.27.1.jar with timestamp <timestamp>
The solution was to copy all jar files to /opt/spark/jars/
. Just keeping this file locally on docker container and load in runtime did not help, but moving to exact this path - helped. If anyone else will bump with this issue, you can try to tune SPARK_CLASSPATH
env - it might also help. SPARK_CLASSPATH
was referenced to /opt/spark/jars/
in my case.