I am trying to build a Hive table and automate it through oozie. Data in the table need not be older than last 30 days.
Action in the work flow would be run every day. It will first purge data that are 30 days older, and insert data for today. Sliding window with 30 days interval.
Can someone show an example how to achieve this?
Hive stores data in HDFS files, and these files are immutable.
Well, actually, with recent Hadoop releases HDFS files can be appended to, or even truncated, but with a low-level API, and Hive has no generic way to modify data files for text/AVRO/Parquet/ORC/whatever format, so for practical purposes HDFS files are immutable for Hive.
One workaround is to use transactional ORC tables that create/rewrite an entire data file on each "transaction" -- requiring a background process for periodic compaction of the resulting mess (e.g. another step of rewriting small files into bigger files).
Another workaround would be an ad hoc batch rewrite of your table whenever you want to get rid of older data -- e.g. every 2 weeks, run a batch that removes data older than 30 days.
> simple design
> alternative design, allows INSERTs while purge is running
But it would be soooooooooooooooooo much simpler if your table was partitioned by month, in ISO format (i.e. YYYY-MM). In that case you could just get the list of partitions and drop all that have a key "older" than (current month -1), with a plain bash script. Believe me, it's simple and rock-solid.