The ingestion of a Hive database is very slow. To reduce the ingestion time, I'm ingesting it to a temp database, then I do hdfs dfs -rmdir /path/to/original_db
and a hdfs dfs -mv /path/of/temp_db /path/to/original_db
. Now I need to recreate the tables.
I plan to do a DROP DATABASE temp_db CASCADE
and recreate the new tables created by spark and change the command generated by SHOW CREATE TABLE temp_db.one_table
. I'll need to do it of each table and replace the temp_db location with the new original location.
Is there a better way to do it? Maybe I can just rename the DB and set a new location for the tables.
No answer here, so here is my solution.
The solution I designed above worked fine. Using it, even my queries that started during the data update are working fine. Yarn takes care of any problem.
Here is the commands that I used when importing a new version of /path/mydb
. I'm using external Hive tables
mydb_landing
. The tables are created using this pyspark command: df.write.option("path", table_data_path).format("orc").mode("overwrite").saveAsTable(".".join(['mydb_landing', table_name]))
create_stmts = [
spark.sql(
'show create table {schema}.{table_name}'.format(schema=base_dest_landing, table_name=table_name)
).first().createtab_stmt
for table_name in tables
]
spark.sql('DROP DATABASE IF EXISTS {} CASCADE'.format(base_dest))
hdfs dfs -rm -r -f -skipTrash "/path/mydb"
mydb_landing
to mydb
. It is a very quick operation, basically a rename: `hdfs dfs -mv '/path/mydb_landing' '/path/mydb``CREATE DATABASE mydb LOCATION '/path/mydb'
for stmt in create_stmts:
create_stmt = stmt.replace('_landing', '')
spark.sql(create_stmt)
spark.sql('DROP DATABASE IF EXISTS mydb_landing CASCADE)
This Impala blog post has some nice ideas using Hive views.