So I have an input file named "students.txt" which contains the following structure: id, first name, last name, date of birth
.
Here is the content of it:
111111 Harry Cover 28/01/1986
222222 John Doeuf 03/01/1996
333333 Jacques Selere 18/07/1998
444444 Jean Breille 06/08/1991
I'm trying to create a Pig script that prints all students grouped by month of birth. As of right now, I have the following user defined function (written in Java):
public class FormatDate extends EvalFunc<DataBag> {
TupleFactory mTupleFactory = TupleFactory.getInstance();
BagFactory mBagFactory = BagFactory.getInstance();
static int id = 0 ;
public DataBag exec(Tuple input) throws IOException {
try {
DataBag output = mBagFactory.newDefaultBag();
Object o = input.get(0);
if (!(o instanceof String)) {
throw new IOException("Expected input to be chararray, but got " + o.getClass().getName());
}
Tuple t = mTupleFactory.newTuple(4);
StringTokenizer tok = new StringTokenizer((String)o, "/", false);
int i = 0 ;
t.set (0, id) ;
while (tok.hasMoreTokens() && i < 4) {
i ++ ;
t.set (i, new String (tok.nextToken ())) ;
}
output.add(t);
return output;
} catch (ExecException ee) {
// error handling goes here
}
return null ;
}
}
My current Pig script looks like this. I'm very new to this so it's probably bad.
REGISTER ./myudfs.jar ;
DEFINE DATE myudfs.FormatDate ;
R1 = LOAD 'students.txt' USING PigStorage('\t')
AS (stud_id : int, firstname : chararray, lastname : chararray, birthdate : chararray) ;
R2 = DISTINCT R1 ;
R3 = FOREACH R2 GENERATE stud_id, firstname, lastname, birthdate, FLATTEN(DATE(birthdate)) AS (id : int, day : chararray, month : chararray, year : chararray) ;
R4 = FOREACH R3 GENERATE stud_id, firstname, lastname, birthdate, month ;
R5 = GROUP R4 BY (month) ;
DUMP R5;
I can't figure out how to get rid of the "month" column without compromising the group by line. Thank you in advance.
I am guessing that you don't want to 'see' the month field, but still have the data grouped by month?
Continuing your script, use a nested FOREACH
to choose which fields are present in the bag groupings:
R6 = FOREACH R5 {
student = FOREACH R4 GENERATE stud_id, firstname, lastname, birthdate;
GENERATE student;
}
DUMP R6;