Search code examples
apache-sparkhivehive-metastoreapache-spark-2.3

Cant save table to hive metastore, HDP 3.0


I cant save a table to hive database anymore using metastore. I see the tables in spark using spark.sql but I cant see the same tables in hive database. I tried this but it doesnt store the table to hive. How can I configure the hive metastore? The spark version is 2.3.1.

If you want more details please comment.

%spark
import org.apache.spark.sql.SparkSession
val spark = (SparkSession
        .builder
        .appName("interfacing spark sql to hive metastore without configuration file")
        .config("hive.metastore.uris", "thrift://xxxxxx.xxx:9083") // replace with your hivemetastore service's thrift url
        .enableHiveSupport() // don't forget to enable hive support
        .getOrCreate())

spark.conf.get("spark.sql.warehouse.dir")// Output: res2: String = /apps/spark/warehouse
spark.conf.get("hive.metastore.warehouse.dir")// NotSuchElement Exception
spark.conf.get("spark.hadoop.hive.metastore.uris")// NotSuchElement Exception

var df = (spark
        .read
        .format("parquet")
        .load(dataPath)

df.createOrReplaceTempView("my_temp_table");
spark.sql("drop table if exists my_table");
spark.sql("create table my_table using hive as select * from my_temp_table");
spark.sql("show tables").show(false)// I see my_table in default database

Update after @catpaws answer: HDP 3.0 and later, Hive and Spark use independent catalogues

Save table to spark catalogue:

df.createOrReplaceTempView("my_temp_table");
spark.sql("create table my_table as select * from my_temp_table");

VS

Save table to hive catalogue:

val hive = com.hortonworks.spark.sql.hive.llap.HiveWarehouseBuilder.session(spark).build()

hive.createTable("newTable")
  .ifNotExists()
  .column("ws_sold_time_sk", "bigint")
  ...// x 200 columns
  .column("ws_ship_date_sk", "bigint")
  .create()

df.write.format(HIVE_WAREHOUSE_CONNECTOR)
  .option("table", "newTable")
  .save()

As you see in this way Hive Warehouse Connector is very impractical for dataframes with hundred columns. Is there any way to save large dataframes to Hive?


Solution

  • As @catpaws said Spark and Hive use independent catalogues. To save dataframe with multiple columns with Hive Warehouse Connector you can use my function:

    save_table_hwc(df1, "default", "table_test1")
    
    def save_table_hwc(df: DataFrame, database: String, tableName: String) : Unit = {
        hive.setDatabase(database)
        hive.dropTable(tableName, true, false)
        hive.createTable(tableName)
        var table_builder = hive.createTable(tableName)
        for( i <- 0 to df.schema.length-1){
            var name = df.schema.toList(i).name.replaceAll("[^\\p{L}\\p{Nd}]+", "")
            var data_type = df.schema.toList(i).dataType.sql
            table_builder = table_builder.column(name, data_type)
        }
        table_builder.create()
        df.write.format(HIVE_WAREHOUSE_CONNECTOR).option("table", tableName).save()
    }