Search code examples
apache-sparkpysparkgoogle-bigquerydataproc

Not able to write into BigQuery JSON Field with Pyspark


I am trying to write JSON object into a BigQuery Table Field. I am using GCP DATAPROC Batch Job.

Could you please let me know if there is a way to write into BQ Table JSON Field with Spark.

Dataproc Version = 2.1 (Spark 3.4, Java 17, Scala 2.13)

Using the BQ Jar = gs://spark-lib/bigquery/spark-bigquery-latest.jar

BQ Table Structure:
Column Type
message JSON

rdd1 = spark.read.json("gs://sample_data_x/sample_json/*.json").rdd
rdd2 = rdd1.map( lambda x : {"message" : x} )
df = rdd2.toDF()
df.printSchema()
df.show()

df.write.format('bigquery')\
    .option('table', ( ''))\
    .option("project", "") \
    .option("temporaryGcsBucket","")\
    .mode("overwrite")\
    .save()
    

Error Message:

   Caused by: com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryException: Unsupported field type: JSON
        at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.Job.reload(Job.java:419)
        at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.Job.waitFor(Job.java:252)
        at com.google.cloud.bigquery.connector.common.BigQueryClient.createAndWaitFor(BigQueryClient.java:333)
        at com.google.cloud.bigquery.connector.common.BigQueryClient.createAndWaitFor(BigQueryClient.java:323)
        at com.google.cloud.bigquery.connector.common.BigQueryClient.loadDataIntoTable(BigQueryClient.java:564)
        at com.google.cloud.spark.bigquery.write.BigQueryWriteHelper.loadDataToBigQuery(BigQueryWriteHelper.java:134)
        at com.google.cloud.spark.bigquery.write.BigQueryWriteHelper.writeDataFrameToBigQuery(BigQueryWriteHelper.java:107)
        ... 44 more

Solution

  • When creating the Dataproc serverless batch job, add the following property: dataproc.sparkBqConnector.version=0.32.2.

    The support for JSON fields has been added in version 0.30.0, but the Spark runtime arrives with connector version 0.28.1. It is not upgraded automatically due to braking changes (related to Numeric/BigNumeric handling) in later versions. This is why you need to specify the connector version manually.