I have an Impala partitioned table, store as Parquet. Can I use Pig to load data from this table, and add partitions as columns?
The Parquet table is defined as:
create table test.test_pig (
name: chararray,
id bigint
)
partitioned by (gender chararray, age int)
stored as parquet;
And the Pig script is like:
A = LOAD '/test/test_pig' USING parquet.pig.ParquetLoader AS (name: bytearray, id: long);
However, gender
and age
are missing when DUMP A
. Only name
and id
are displayed.
I have tried with:
A = LOAD '/test/test_pig' USING parquet.pig.ParquetLoader AS (name: bytearray, id: long, gender: chararray, age: int);
But I would receive error like:
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1031: Incompatable schema: left is "name:bytearray,id:long,gender:bytearray,age:int", right is "name:bytearray,id:long"
Hope to get some advice here. Thank you!
You should test with the org.apache.hcatalog.pig.HCatLoader library.
Normally, Pig supports read from/write into partitioned tables;
read:
This load statement will load all partitions of the specified table. /* myscript.pig */ A = LOAD 'tablename' USING org.apache.hcatalog.pig.HCatLoader(); ... ... If only some partitions of the specified table are needed, include a partition filter statement immediately following the load statement in the data flow. (In the script, however, a filter statement might not immediately follow its load statement.) The filter statement can include conditions on partition as well as non-partition columns.
write
HCatOutputFormat will trigger on dynamic partitioning usage if necessary (if a key value is not specified) and will inspect the data to write it out appropriately.
https://cwiki.apache.org/confluence/display/Hive/HCatalog+DynamicPartitions
However, I think this hasn't been yet properly tested with parquet files (at least not by the Cloudera guys) :
Parquet has not been tested with HCatalog. Without HCatalog, Pig cannot correctly read dynamically partitioned tables; that is true for all file formats.
http://www.cloudera.com/content/www/en-us/documentation/enterprise/latest/topics/cdh_ig_parquet.html