How do you efficiently design a Hive/Impala table considering the following facts?
If you are doing analytics on this data then a solid choice with Impala is using Parquet format. What has worked well for our users is to partition the date by year, month, day based a date value on the record.
So for example CREATE TABLE foo (tool_id int, eff_dt timestamp) partition (year int, month int, day int) stored as parquet
When loading the data into this table we use something like this to create dynamic partitions:
INSERT INTO foo partition (year, month, day)
SELECT tool_id, eff_dt, year(eff_dt), month(eff_dt), day(eff_dt)
FROM source_table;
Then you train your users that if they want the best performance to add YEAR, MONTH, DAY to their WHERE
clause so that it hits the partition for better performance. Then have them add the eff_dt
in the SELECT
statement so they have a date value in the format they like see in their final results.
In CDH, Parquet is storing by default data in 256MB chunks (which is configurable). Here is how to configure it: http://www.cloudera.com/documentation/enterprise/latest/topics/impala_parquet_file_size.html