Search code examples
apache-pigstore

Write a unique key of the group as folder name and the bag content as records?


Objective : To write unique key of the group as folder name and the bag content as records.

 File : employee.txt

 #JoiningDate   Employee Id     Employee Name
   20140302        1             A
   20140302        2             B
   20140302        3             C
   20140303        4             D
   20140303        5             E
   20140303        6             F

Pig script :

  X = load 'employee.txt' using PigStorage('\t') as (joining_date:chararray, employee_id:long, employee_name:chararray);

  Y =  group X by joining_date;

Output of this would be  (Y) :

(20140302, {(20140302,1,A), (20140302,2,B), (20140302,3,C)})
(20140303, {(20140303,4,D), (20140303,5,E), (20140303,6,F)})

Objective is to have tow folders in the output path :

    1. outputfolder/20140302 : having three records
            20140302,1,A
            20140302,2,B    
            20140302,3,C
    2. outputfolder/20140303  : 
            20140303,4,D
            20140303,5,E
            20140303,6,F

Tried

 store Y into 'outputfolder' using org.apache.pig.piggybank.storage.MultiStorage('outputfolder', '0', 'none', ',');

Seeing result as below :

     1. outputfolder/20140302/20140302-0
            (20140302, {(20140302,1,A), (20140302,2,B), (20140302,3,C)})
     2. outputfolder/20140303/20140303-0
            (20140303, {(20140303,4,D), (20140303,5,E), (20140303,6,F)})

Solution

  • One option could be just flatten the values before store command.

    X = load 'employee.txt' using PigStorage('\t') as (joining_date:chararray, employee_id:long, employee_name:chararray);
    Y = group X by joining_date;
    Z = FOREACH Y GENERATE FLATTEN($1);
    store Z into 'outputfolder' using org.apache.pig.piggybank.storage.MultiStorage('outputfolder', '0', 'none', ',');
    

    Output will be stored in outputfolder/20140302 folder and file name start with something like this 20140302-0,000