apache-spark hadoop hive hive-metastore hadoop3

Spark and Hive in Hadoop 3: Difference between metastore.catalog.default and spark.sql.catalogImplementation

I'm working on a Hadoop cluster (HDP) with Hadoop 3. Spark and Hive are also installed.

Since Spark and Hive catalogs are separated, it's a bit confusing sometimes, to know how and where to save data in a Spark application.

I know, that the property spark.sql.catalogImplementation can be set to either in-memory (to use a Spark session-based catalog) or hive (using Hive catalog for persistent metadata storing -> but the metadata is still separated from the Hive DBs and tables).

I'm wondering what the property metastore.catalog.default does. When I set this to hive I can see my Hive tables, but since the tables are stored in the /warehouse/tablespace/managed/hive directory in HDFS, my user has no access to this directory (because hive is of course the owner).

So, why should I set the metastore.catalog.default = hive, if I can't access the tables from Spark? Does it have something to do with Hortonwork's Hive Warehouse Connector?

Thank you for your help.

Solution

Catalog implementations

There is two catalog implementations :

in-memory to create in-memory tables only available in the Spark session,
hive to create persistent tables using an external Hive Metastore.

More details here.

Metastore catalog

In the same Hive Metastore can coexist multiple catalogs. For example HDP versions from 3.1.0 to 3.1.4 use a different catalog to save Spark tables and Hive tables.
You may want to use metastore.catalog.default=hive to read Hive external tables using Spark API. The table location in HDFS must be accessible to the user running the Spark app.

HDP 3.1.4 documentation

You can get informations on access patterns according to Hive table type, read/write features and security requirements in the following links :