Search code examples
hadoophivehdfs

Hive: how to move a database and recreate tables to minimize downtime?


The ingestion of a Hive database is very slow. To reduce the ingestion time, I'm ingesting it to a temp database, then I do hdfs dfs -rmdir /path/to/original_db and a hdfs dfs -mv /path/of/temp_db /path/to/original_db. Now I need to recreate the tables.

I plan to do a DROP DATABASE temp_db CASCADE and recreate the new tables created by spark and change the command generated by SHOW CREATE TABLE temp_db.one_table. I'll need to do it of each table and replace the temp_db location with the new original location.

Is there a better way to do it? Maybe I can just rename the DB and set a new location for the tables.


Solution

  • No answer here, so here is my solution.

    The solution I designed above worked fine. Using it, even my queries that started during the data update are working fine. Yarn takes care of any problem.

    Here is the commands that I used when importing a new version of /path/mydb. I'm using external Hive tables

    1. Import new data to mydb_landing. The tables are created using this pyspark command: df.write.option("path", table_data_path).format("orc").mode("overwrite").saveAsTable(".".join(['mydb_landing', table_name]))
    2. I already have all table names, so I get all create statments:
            create_stmts = [
                spark.sql(
                    'show create table {schema}.{table_name}'.format(schema=base_dest_landing, table_name=table_name)
                    ).first().createtab_stmt 
                for table_name in tables
                ]
    
    1. Drop target database in Hive: spark.sql('DROP DATABASE IF EXISTS {} CASCADE'.format(base_dest))
    2. Since it has just external tables, I now delete the data from hdfs: hdfs dfs -rm -r -f -skipTrash "/path/mydb"
    3. Move from mydb_landing to mydb. It is a very quick operation, basically a rename: `hdfs dfs -mv '/path/mydb_landing' '/path/mydb``
    4. Now I must recreate the database and tables pointing to the "new" location: CREATE DATABASE mydb LOCATION '/path/mydb'
    5. Recreate tables. This guarantees that I get any schema update:
       for stmt in create_stmts:
               create_stmt = stmt.replace('_landing', '')
               spark.sql(create_stmt)
    
    1. Delete landing database from Hive: spark.sql('DROP DATABASE IF EXISTS mydb_landing CASCADE)

    This Impala blog post has some nice ideas using Hive views.