Search code examples
apache-pighdfsdatabase-partitioningparquet

Can Pig be used to LOAD from Parquet table in HDFS with partition, and add partitions as columns?


I have an Impala partitioned table, store as Parquet. Can I use Pig to load data from this table, and add partitions as columns?

The Parquet table is defined as:

create table test.test_pig (
    name: chararray,
    id bigint
)
partitioned by (gender chararray, age int)
stored as parquet;

And the Pig script is like:

A = LOAD '/test/test_pig' USING parquet.pig.ParquetLoader AS (name: bytearray, id: long);

However, gender and age are missing when DUMP A. Only name and id are displayed.

I have tried with:

A = LOAD '/test/test_pig' USING parquet.pig.ParquetLoader AS (name: bytearray, id: long, gender: chararray, age: int);

But I would receive error like:

ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1031: Incompatable schema: left is "name:bytearray,id:long,gender:bytearray,age:int", right is "name:bytearray,id:long"

Hope to get some advice here. Thank you!


Solution

  • You should test with the org.apache.hcatalog.pig.HCatLoader library.

    Normally, Pig supports read from/write into partitioned tables;

    read:

    This load statement will load all partitions of the specified table. /* myscript.pig */ A = LOAD 'tablename' USING org.apache.hcatalog.pig.HCatLoader(); ... ... If only some partitions of the specified table are needed, include a partition filter statement immediately following the load statement in the data flow. (In the script, however, a filter statement might not immediately follow its load statement.) The filter statement can include conditions on partition as well as non-partition columns.

    https://cwiki.apache.org/confluence/display/Hive/HCatalog+LoadStore#HCatalogLoadStore-RunningPigwithHCatalog

    write

    HCatOutputFormat will trigger on dynamic partitioning usage if necessary (if a key value is not specified) and will inspect the data to write it out appropriately.

    https://cwiki.apache.org/confluence/display/Hive/HCatalog+DynamicPartitions

    However, I think this hasn't been yet properly tested with parquet files (at least not by the Cloudera guys) :

    Parquet has not been tested with HCatalog. Without HCatalog, Pig cannot correctly read dynamically partitioned tables; that is true for all file formats.

    http://www.cloudera.com/content/www/en-us/documentation/enterprise/latest/topics/cdh_ig_parquet.html