Search code examples
hadoophiveimpalahadoop-partitioning

Hadoop partitioning. How do you efficiently design a Hive/Impala table?


How do you efficiently design a Hive/Impala table considering the following facts?

  1. The table receives tool data of about 100 million rows every day. The date on which it receives the data is stored in a column in the table along with its tool id.
  2. Each tool receives about 500 runs per day which is identified by column run id. Each run id contains data approximately of size 1 mb.
  3. The default size of the block is 64 mb.
  4. The table can be searched by date, tool id and run id in this order.

Solution

  • If you are doing analytics on this data then a solid choice with Impala is using Parquet format. What has worked well for our users is to partition the date by year, month, day based a date value on the record.

    So for example CREATE TABLE foo (tool_id int, eff_dt timestamp) partition (year int, month int, day int) stored as parquet

    When loading the data into this table we use something like this to create dynamic partitions:

    INSERT INTO foo partition (year, month, day)
    SELECT tool_id, eff_dt, year(eff_dt), month(eff_dt), day(eff_dt)
    FROM source_table;
    

    Then you train your users that if they want the best performance to add YEAR, MONTH, DAY to their WHERE clause so that it hits the partition for better performance. Then have them add the eff_dt in the SELECT statement so they have a date value in the format they like see in their final results.

    In CDH, Parquet is storing by default data in 256MB chunks (which is configurable). Here is how to configure it: http://www.cloudera.com/documentation/enterprise/latest/topics/impala_parquet_file_size.html