I want to know about the compaction in Impala tables but can't find material to study about. What are different techniques and where I can find material to study about it.
The principal technique for compaction
is to avoid the small file problem
and it depends of your use case.
For example, you could have a process that is writing small files into HDFS
and you want to query those files like an Impala table
. You could have a staging table
for these small files and load the base table
using INSERT INTO TABLE base_table SELECT .....FROM stg_table
to compact the tiny files into bigger files.
Another use case would be with partitioning
.
A major risk when using partitioning is creating partitions that lead you into the small files problem.
When this happens, partitioning a table will actually worsen query performance
(the opposite of the goal when using partitioning) because it causes too many small files to be created.
This is more likely when using dynamic partitioning
, but it could still
happen with static partitioning
—for example if you added a new partition to a sales table
on a daily basis containing the sales from the previous day,
and each day’s data is not particularly big.
When choosing your partitions, you want to strike a happy balance between too many partitions (causing the small files problem) and too few partitions (providing performance little benefit). The partition column or columns should have a reasonable number of values for the partitions—but what you should consider reasonable is difficult to quantify.
Using dynamic partitioning
is particularly dangerous because if you're not careful,
it's easy to partition on a column with too many distinct values.
Imagine a use case where you are often looking for data that falls within
a time frame that you would specify in your query.
You might think that it's a good idea to partition on a column that pertains to time.
But a TIMESTAMP
column could have the time to the nanosecond, so every row could have a unique value;
that would be a terrible choice for a partition column! Even to the minute or hour could create
far too many partitions, depending on the nature of your data;
partitioning by larger time units like day, month, or even year might be a better choice.
The reading above it's only an introduction of the problem, there are more many use cases and the general topic is performance and tuning.
You could get a start from the Cloudera documentation. You could follow this link:
Hope this helps.