Search code examples
apache-sparkpysparkazure-databricksapache-hudi

Writing Hudi table into Azure DataLake Gen2


I need to create a Hudi table from a PySpark DataFrame using Azure Databricks notebook and save it into Azure DataLake Gen2. Here's my approach:

spark.sparkContext.setSystemProperty("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
spark.sparkContext.setSystemProperty("spark.sql.extensions", "org.apache.spark.sql.hudi.HoodieSparkSessionExtension")

tableName = "my_hudi_table"
basePath = <<path_to_save_data>>

dataGen = sc._jvm.org.apache.hudi.QuickstartUtils.DataGenerator()
inserts = sc._jvm.org.apache.hudi.QuickstartUtils.convertToStringList(dataGen.generateInserts(10))
df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))

hudi_options = {
    'hoodie.table.name': tableName,
    'hoodie.datasource.write.recordkey.field': 'uuid',
    'hoodie.datasource.write.partitionpath.field': 'partitionpath',
    'hoodie.datasource.write.table.name': tableName,
    'hoodie.datasource.write.operation': 'upsert',
    'hoodie.datasource.write.precombine.field': 'ts',
    'hoodie.upsert.shuffle.parallelism': 2,
    'hoodie.insert.shuffle.parallelism': 2
}

This creates a PySpark DataFrame but when I try to save it as a Hudi table I get the following error:

df.write.format("hudi").options(**hudi_options).mode("overwrite").save(basePath)

org.apache.hudi.exception.HoodieException: hoodie only support org.apache.spark.serializer.KryoSerializer as spark.serializer

However, if I run the following command:

print(spark.sparkContext.getConf().get("spark.serializer"))

org.apache.spark.serializer.KryoSerializer

Note that I haven't done:

spark.conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")

Because I get the error:

AnalysisException: Cannot modify the value of a Spark config: spark.serializer

The cluster I'm working on uses Databricks Runtime version 9.1 LTS (includes Apache Spark 3.1.2, Scala 2.12) and I have installed the jar file "hudi_spark3_1_2_bundle_2_12_0_10_1.jar". I don't think it is an issue with the DataLake since I can save the data in another format (parquet, csv). Am I missing something?

Any help would be appreciated. Thanks in advance.


Solution

  • cannot modify value of a Spark config

    In databricks notebook, the conf are not updatable within code and throw this error. You will have to configure those hudi conf at the cluster level, before the spark session is created.

    Overall sparkSession are singleton and you cannot modify the config at runtime, or you will have to stop/start again to make the config taken into account.

    BTW, spark.sparkContext.setSystemProperty is not meant for settingbthose conf.

    Reference: https://kb.databricks.com/en_US/scala/cannot-modify-spark-serializer