apache-spark pyspark hive aws-glue apache-hudi

Spark-Hudi: Save as table to Glue/Hive catalog

Scenario: Store Hudi Spark dataframe using saveAsTable(data frame writer) method, such that Hudi supported table with org.apache.hudi.hadoop.HoodieParquetInputFormat Input format schema is automaticaly generated.

Currently, saveAsTable works fine with normal (non Hudi table), Which generates default input format. I want to automate the Hudi table creation with the supported input file format, either with some overridden version saveAsTable or other way staying in the premise of spark.

Solution

Hudi DOES NOT support saveAsTable yet.

You have two options to sync hudi tables with a hive metastore:

Sync inside spark

val hudiOptions = Map[String,String](
...
  DataSourceWriteOptions.HIVE_URL_OPT_KEY -> "jdbc:hive2://<thrift server host>:<port>",
  DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY -> "true",
  DataSourceWriteOptions.HIVE_DATABASE_OPT_KEY -> "<the database>",
  DataSourceWriteOptions.HIVE_TABLE_OPT_KEY -> "<the table>",
  DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY -> "<the partition field>",
  DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY -> classOf[MultiPartKeysValueExtractor].getName
...
)
// Write the DataFrame as a Hudi dataset
// it will appear in hive (similar to saveAsTable..)
test_parquet_partition.write
  .format("org.apache.hudi")
  .option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL)
  .options(hudiOptions)
  .mode(SaveMode.Overwrite)
  .save(hudiTablePath)

Sync outside spark

use the bash script after running your hudi spark transformations hudi documentation

cd hudi-hive

./run_sync_tool.sh  --jdbc-url jdbc:hive2:\/\/hiveserver:10000 --user hive --pass hive --partitioned-by partition --base-path <basePath> --database default --table <tableName>```)
```bash
cd hudi-hive

./run_sync_tool.sh  --jdbc-url jdbc:hive2:\/\/hiveserver:10000 --user hive --pass hive --partitioned-by partition --base-path <basePath> --database default --table <tableName>```