Search code examples
hadoopoozie

Oozie workflow: how to keep most recent 30 days in the table


I am trying to build a Hive table and automate it through oozie. Data in the table need not be older than last 30 days.

Action in the work flow would be run every day. It will first purge data that are 30 days older, and insert data for today. Sliding window with 30 days interval.

Can someone show an example how to achieve this?


Solution

  • Hive stores data in HDFS files, and these files are immutable.

    Well, actually, with recent Hadoop releases HDFS files can be appended to, or even truncated, but with a low-level API, and Hive has no generic way to modify data files for text/AVRO/Parquet/ORC/whatever format, so for practical purposes HDFS files are immutable for Hive.

    One workaround is to use transactional ORC tables that create/rewrite an entire data file on each "transaction" -- requiring a background process for periodic compaction of the resulting mess (e.g. another step of rewriting small files into bigger files).

    Another workaround would be an ad hoc batch rewrite of your table whenever you want to get rid of older data -- e.g. every 2 weeks, run a batch that removes data older than 30 days.

    > simple design

    • make sure that you will have no INSERT or SELECT running until the purge is over
    • create a new partioned table with the same structure plus a dummy partitioning column
    • copy all the data-to-keep to that dummy partition
    • run an EXCHANGE PARTITION command
    • drop the partitioned table
    • now the older data is gone, and you can resume INSERTs

    > alternative design, allows INSERTs while purge is running

    • rebuild your table with a dummy partitioning key, and make sure that all INSERTs always go into "current" partition
    • at purge time, rename "current" partition as "to_be_purged" (and make sure that you will run no SELECT until purge is over, otherwise you may get duplicates)
    • copy all the data-to-keep from "to_be_purged" to "current"
    • drop partition "to_be_purged"
    • now the older data is gone

    But it would be soooooooooooooooooo much simpler if your table was partitioned by month, in ISO format (i.e. YYYY-MM). In that case you could just get the list of partitions and drop all that have a key "older" than (current month -1), with a plain bash script. Believe me, it's simple and rock-solid.