I'm developing several Hive-based ETLs, and have seriously violated the DRY (Don't Repeat Yourself) principle.
How can I gain more composability / reusability in my Hive SQL scripts?
Basically I'm looking for an INCLUDE
or IMPORT
command.
Any tips? For folks that write a ton of HiveQL scripts, how do you not repeat yourself?
Thanks in advance.
You just need to have your reusable hive script into small .HQL
files. Create shell script for ordered execution of the .HQL
files.
All you need to make sure that the current executing script do have its pre-requisite completed. i.e. it should have expected table/schema.
There are two things, which would help you.
1) the hive -f abcd.hql
command : Use a series of this command in the shell script for ordered execution of the scripts. e.g.
hive -f specific.hql
wait $!
hive -f reusable1.hql
wait $!
hive -f specific2.hql
wait $!
hive -f specific3.hql
wait $!
hive -f reusable2.hql
You might note the wait
commands, basically to tell the shell to wait for last command to complete. This is very important when output table of the last script is input to the next one.
2) Use hiveconf
to customize table, schema (or even where conditions or column names) at runtime. e.g. create table '${hiveconf:schema.name}.my_table ...