Search code examples
databasehadoopdatabase-schemaclouderaimpala

load data into impala partitioned table


I have data in HDFS in following dir structure :

/exported/2014/07/01/00/SEARCHES/part-m-00000.bz2
                                 part-m-00001.bz2
/exported/2014/07/01/02/SEARCHES/part-m-00000.bz2
                                 part-m-00001.bz2
                                 part-m-00003.bz2
.
.
.
.
/exported/2014/08/01/09/SEARCHES/part-m-00005 .bz2

there are multiple part files in each subdirectory.

I want to load this dataset into impala table, so use following query to create table :

CREATE EXTERNAL TABLE search(time_stamp TIMESTAMP, ..... url STRING,domain STRING) PARTITIONED BY (year INT, month INT, day INT. hour INT) row format delimited fields terminated by '\t';

Then

ALTER TABLE search ADD PARTITION (year=2014, month=08, day=01) LOCATION '/data/jobs/exported/2014/08/01/*/SEARCHES/';

But it failed to load with following error :

ERROR: AnalysisException: Failed to load metadata for table: magneticbi.search_mmx
CAUSED BY: TableLoadingException: Failed to load metadata for table: search_mmx
CAUSED BY: RuntimeException: Compressed text files are not supported: part-m-00000.bz2

not sure what is the correct way to do this.

Anyone can help in this ?

Thanks


Solution

  • Here's a link to a table from Cloudera that describes your options. To summarize:

    Impala supports the following compression codecs:

    • Snappy. Recommended for its effective balance between compression ratio and decompression speed. Snappy compression is very fast, but GZIP provides greater space savings. Not supported for text files.
    • GZIP. Recommended when achieving the highest level of compression (and therefore greatest disk-space savings) is desired. Not supported for text files. Deflate. Not supported for text files.
    • BZIP2. Not supported for text files.
    • LZO, for Text files only. Impala can query LZO-compressed Text tables, but currently cannot create them or insert data into them; perform these operations in Hive.