Search code examples
apache-sparkpysparkhiveaws-glueapache-hudi

Spark-Hudi: Save as table to Glue/Hive catalog


Scenario: Store Hudi Spark dataframe using saveAsTable(data frame writer) method, such that Hudi supported table with org.apache.hudi.hadoop.HoodieParquetInputFormat Input format schema is automaticaly generated.

Currently, saveAsTable works fine with normal (non Hudi table), Which generates default input format. I want to automate the Hudi table creation with the supported input file format, either with some overridden version saveAsTable or other way staying in the premise of spark.


Solution

  • Hudi DOES NOT support saveAsTable yet.

    You have two options to sync hudi tables with a hive metastore:

    Sync inside spark

    val hudiOptions = Map[String,String](
    ...
      DataSourceWriteOptions.HIVE_URL_OPT_KEY -> "jdbc:hive2://<thrift server host>:<port>",
      DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY -> "true",
      DataSourceWriteOptions.HIVE_DATABASE_OPT_KEY -> "<the database>",
      DataSourceWriteOptions.HIVE_TABLE_OPT_KEY -> "<the table>",
      DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY -> "<the partition field>",
      DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY -> classOf[MultiPartKeysValueExtractor].getName
    ...
    )
    // Write the DataFrame as a Hudi dataset
    // it will appear in hive (similar to saveAsTable..)
    test_parquet_partition.write
      .format("org.apache.hudi")
      .option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL)
      .options(hudiOptions)
      .mode(SaveMode.Overwrite)
      .save(hudiTablePath)
    

    Sync outside spark

    use the bash script after running your hudi spark transformations hudi documentation

    cd hudi-hive
    
    ./run_sync_tool.sh  --jdbc-url jdbc:hive2:\/\/hiveserver:10000 --user hive --pass hive --partitioned-by partition --base-path <basePath> --database default --table <tableName>```)
    ```bash
    cd hudi-hive
    
    ./run_sync_tool.sh  --jdbc-url jdbc:hive2:\/\/hiveserver:10000 --user hive --pass hive --partitioned-by partition --base-path <basePath> --database default --table <tableName>```