Scenario:
Store Hudi Spark dataframe using saveAsTable(data frame writer)
method, such that Hudi supported table with org.apache.hudi.hadoop.HoodieParquetInputFormat
Input format schema is automaticaly generated.
Currently, saveAsTable
works fine with normal (non Hudi table), Which generates default input format.
I want to automate the Hudi table creation with the supported input file format, either with some overridden version saveAsTable
or other way staying in the premise of spark.
Hudi DOES NOT support saveAsTable
yet.
You have two options to sync hudi tables with a hive metastore:
val hudiOptions = Map[String,String](
...
DataSourceWriteOptions.HIVE_URL_OPT_KEY -> "jdbc:hive2://<thrift server host>:<port>",
DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY -> "true",
DataSourceWriteOptions.HIVE_DATABASE_OPT_KEY -> "<the database>",
DataSourceWriteOptions.HIVE_TABLE_OPT_KEY -> "<the table>",
DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY -> "<the partition field>",
DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY -> classOf[MultiPartKeysValueExtractor].getName
...
)
// Write the DataFrame as a Hudi dataset
// it will appear in hive (similar to saveAsTable..)
test_parquet_partition.write
.format("org.apache.hudi")
.option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL)
.options(hudiOptions)
.mode(SaveMode.Overwrite)
.save(hudiTablePath)
use the bash script after running your hudi spark transformations hudi documentation
cd hudi-hive
./run_sync_tool.sh --jdbc-url jdbc:hive2:\/\/hiveserver:10000 --user hive --pass hive --partitioned-by partition --base-path <basePath> --database default --table <tableName>```)
```bash
cd hudi-hive
./run_sync_tool.sh --jdbc-url jdbc:hive2:\/\/hiveserver:10000 --user hive --pass hive --partitioned-by partition --base-path <basePath> --database default --table <tableName>```