I am trying to write JSON object into a BigQuery Table Field. I am using GCP DATAPROC Batch Job.
Could you please let me know if there is a way to write into BQ Table JSON Field with Spark.
Dataproc Version = 2.1 (Spark 3.4, Java 17, Scala 2.13)
Using the BQ Jar = gs://spark-lib/bigquery/spark-bigquery-latest.jar
BQ Table Structure:
Column Type
message JSON
rdd1 = spark.read.json("gs://sample_data_x/sample_json/*.json").rdd
rdd2 = rdd1.map( lambda x : {"message" : x} )
df = rdd2.toDF()
df.printSchema()
df.show()
df.write.format('bigquery')\
.option('table', ( ''))\
.option("project", "") \
.option("temporaryGcsBucket","")\
.mode("overwrite")\
.save()
Error Message:
Caused by: com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryException: Unsupported field type: JSON
at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.Job.reload(Job.java:419)
at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.Job.waitFor(Job.java:252)
at com.google.cloud.bigquery.connector.common.BigQueryClient.createAndWaitFor(BigQueryClient.java:333)
at com.google.cloud.bigquery.connector.common.BigQueryClient.createAndWaitFor(BigQueryClient.java:323)
at com.google.cloud.bigquery.connector.common.BigQueryClient.loadDataIntoTable(BigQueryClient.java:564)
at com.google.cloud.spark.bigquery.write.BigQueryWriteHelper.loadDataToBigQuery(BigQueryWriteHelper.java:134)
at com.google.cloud.spark.bigquery.write.BigQueryWriteHelper.writeDataFrameToBigQuery(BigQueryWriteHelper.java:107)
... 44 more
When creating the Dataproc serverless batch job, add the following property: dataproc.sparkBqConnector.version=0.32.2
.
The support for JSON fields has been added in version 0.30.0, but the Spark runtime arrives with connector version 0.28.1. It is not upgraded automatically due to braking changes (related to Numeric/BigNumeric handling) in later versions. This is why you need to specify the connector version manually.