Search code examples
hadoopapache-pigapache-pig-grunt

Inserting tuples inside an inner bag using Pig Latin - Hadoop


I am trying to create the following format of relation using Pig Latin:

userid, day, {(pid,fulldate, x,y),(pid,fulldate, x,y), ...}

Relation description: Each user (userid) in each day (day) has purchased multiple products (pid)

I am Loading the data into:

A= LOAD '**from a HDFS URL**' AS (pid: chararray,userid: 
chararray,day:int,fulldate: chararray,x: chararray,y:chararray);
B= GROUP A BY (userid, day);
Describe B;

B: {group: (userid: chararray,day: int),A: {(pid: chararray,day: int,fulldate: chararray,x: chararray,userid: chararray,y: chararray)}}

C= FOREACH B FLATTEN(B) AS (userid,day), $1.pid, $1.fulldate,$1.x,$1.y;
Describe C;

C: {userid: chararray,day: int,{(pid: chararray)}},{(fulldate: chararray)},{(x: chararray)},{(y: chararray)}}

The result of Describe C does not give the format I want ! What I am doing wrong?


Solution

  • You are correct till the GROUP BY part. After that however you are trying to do something messy. I'm actually not sure what is happening for your alias C. To arrive at the format you are looking for, you will need a nested foreach.

    C = FOREACH B {
             data = A.pid, A.fulldate, A.x, A.y;
             GENERATE FLATTEN(group), data;
        }
    

    This allows C to have one record for each (userid, day) and all the corresponding (pid,fulldate, x, y) tuples in a bag. You can read more about nested foreach here: https://www.safaribooksonline.com/library/view/programming-pig/9781449317881/ch06.html (Search for nested foreach in that link).