Search code examples
hadoophivepartitioninghadoop-archive

Querying data from har archives - Apache Hive


I am using Hadoop and facing the dreaded problem of large numbers of small files. I need to be able to create har archives out of existing hive partitions and query them at the same time. However, Hive apparently supports archiving partitions only in managed tables and not external tables - which is pretty sad. I am trying to find a workaround for this, by manually archiving the files inside a partition's directory, using hadoop's archive tool. I now need to configure hive to be able to query the data stored in these archives, along with the unarchived data stored in other partition directories. Please note that we only have external tables in use.

The namespace for accessing the files in the created partition-har corresponds to the hdfs path of the partition dir. For example, For example, a file in hdfs:

hdfs:///user/user1/data/db1/tab1/ds=2016_01_01/f1.txt

can after archiving be accessed as:

har:///user/user1/data/db1/tab1/ds=2016_01_01.har/f1.txt

Would it be possible for hive to query the har archives from the external table? Please suggest a way if yes.

Best Regards


Solution

  • In practice, the line between "managed" and "external" tables is very thin.
    My suggestion:

    • create a "managed" table
    • add explicitly partitions for some days in the future, but with ad hoc locations -- i.e. the directories your external process expects to use
    • let the external process dump its file directly at HDFS level -- they are automagically exposed in Hive queries, "managed" or not
      (the Metastore does not track individual files and blocks, they are detected on each query; as a side note, you can run backup & restore operations at HDFS level if you wish, as long as you don't mess with the directory structure)
    • when a partition is "cold" and you are pretty sure there will never be another file dumped there, you can run a Hive command to archive the partition i.e. move small files in a single HAR + flag the partition as "archived" in the Metastore

    Bonus: it's easy to unarchive your partition within Hive (whereas there is no hadoop unarchive command AFAIK).

    Caveat: it's a "managed" table so remember not to DROP anything unless you have safely moved your data out of the Hive-managed directories.