Search code examples
javaapache-piguser-defined-functions

How can I "group by" using a column without displaying it?


So I have an input file named "students.txt" which contains the following structure: id, first name, last name, date of birth. Here is the content of it:

111111 Harry Cover 28/01/1986
222222 John Doeuf 03/01/1996
333333 Jacques Selere 18/07/1998
444444 Jean Breille 06/08/1991

I'm trying to create a Pig script that prints all students grouped by month of birth. As of right now, I have the following user defined function (written in Java):

public class FormatDate extends EvalFunc<DataBag> {
    TupleFactory mTupleFactory = TupleFactory.getInstance();
    BagFactory mBagFactory = BagFactory.getInstance();

    static int id = 0 ;
    public DataBag exec(Tuple input) throws IOException {
        try {
            DataBag output = mBagFactory.newDefaultBag();
            Object o = input.get(0);
            if (!(o instanceof String)) {
                throw new IOException("Expected input to be chararray, but got " + o.getClass().getName());
            }

            Tuple t = mTupleFactory.newTuple(4);
            StringTokenizer tok = new StringTokenizer((String)o, "/", false);

            int i = 0 ;
            t.set (0, id) ;
            while (tok.hasMoreTokens() && i < 4) {
                i ++ ;
                t.set (i, new String (tok.nextToken ())) ;
            }
            output.add(t);

            return output;
        } catch (ExecException ee) {
            // error handling goes here
        }
        return null ;
    }
}

My current Pig script looks like this. I'm very new to this so it's probably bad.

REGISTER ./myudfs.jar ;
DEFINE DATE myudfs.FormatDate ;
R1 = LOAD 'students.txt' USING PigStorage('\t') 
     AS (stud_id : int, firstname : chararray, lastname : chararray, birthdate : chararray) ;
R2 = DISTINCT R1 ;
R3 = FOREACH R2 GENERATE stud_id, firstname, lastname, birthdate, FLATTEN(DATE(birthdate)) AS (id : int, day : chararray, month : chararray, year : chararray) ;
R4 = FOREACH R3 GENERATE stud_id, firstname, lastname, birthdate, month ;
R5 = GROUP R4 BY (month) ;
DUMP R5;

I can't figure out how to get rid of the "month" column without compromising the group by line. Thank you in advance.


Solution

  • I am guessing that you don't want to 'see' the month field, but still have the data grouped by month?

    Continuing your script, use a nested FOREACH to choose which fields are present in the bag groupings:

    R6 = FOREACH R5 {
        student = FOREACH R4 GENERATE stud_id, firstname, lastname, birthdate;
        GENERATE student;
    }
    
    DUMP R6;