Search code examples
apache-pighcatalog

Can HCatalog in Apache Pig just load a specific partition?


I need to load the data for a certain partition (date) in Pig. This data was created in Hive, and partitioned on date. So i want to load the data in Pig via HCatalog.

The HCatalog documentation says that to load a certain partition in Pig, you first load the whole dataset and then filter on it, i.e. :

a = load 'web_logs' using org.apache.hcatalog.pig.HCatLoader();
b = filter a by datestamp > '20110924';

https://cwiki.apache.org/confluence/display/Hive/HCatalog+LoadStore But I am afraid this first loads the whole data in bag a, then only filters it in b. Am i correct or no ?

In Hive this works (without HCat), you can prune the data to just get the partition you want, i.e. :

LOAD DATA  INPATH 'filepath' INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]

What is the equivalent of this construct in Pig with HCatalog ?

Thanks!


Solution

  • I see two parts to your question.

    Part 1, https://cwiki.apache.org/confluence/display/Hive/HCatalog+LoadStore But I am afraid this first loads the whole data in bag a, then only filters it in b. Am i correct or no ?

    Ans 1) NO, when you apply filters just after the load statement, hcatalog is smart enough to load specified partitions, which you specified in your filter statement.

    Part 2) LOAD DATA INPATH 'filepath' INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]

    What is the equivalent of this construct in Pig with HCatalog ?

    Ans 2) YES, you can use store a into 'tablename' using org.apache.hcatalog.pig.HCatStorer('particol1=val1,partcol2=val2');

    eg: store a into 'tablename' using org.apache.hcatalog.pig.HCatStorer('datestamp=20110924');

    Please drop a comment if you have any doubts.

    Thanks