I am trying to load a bag data type into a pig table and am coming up with null values instead.
Sample input:
A000,B000,C000,1.0,1-1-14,3-31-14,{(A101,1-Jan-2014,0.03,0.04)}
A001,B001,C001,10.0,1-1-14,3-31-14,{(A101,1-Jan-2014,0.03,0.045)}
A002,B002,C002,100.0,1-1-14,3-31-14,{(A101,1-Jan-2014,0.03,0.04)}
Pig Script:
raw = LOAD 'input/meh.log' USING PigStorage(',') AS (PID, FUNDID, GICID, balance, startDate, endDate, rates:bag{t:tuple(t1,t2,t3,t4)});
DUMP raw;
Output:
(A000,B000,C000,1.0,1-1-14,3-31-14,)
(A001,B001,C001,10.0,1-1-14,3-31-14,)
(A002,B002,C002,100.0,1-1-14,3-31-14,)
^Bag values should be here
What am I doing wrong? I've tried removing the bag/tuple declarations from the LOAD function, and still nothing. I used this same approach when working on the bag tutorial that came with Pig, and that seemed to work just fine.
UPDATE: If I set the bag input so that each tuple has one value, then this script works. I'm starting to think this may be an issue with my version of Pig (0.12.2). I had to build Pig using Ant so that it can run on Hadoop 2.3. Thoughts?
Reformatted the data
A000 B000 C000 1 1-1-14 3-31-14 {(101,1-Jan-2014,0.03,0.04)}
A001 B001 C001 10 1-1-14 3-31-14 {(101,1-Jan-2014,0.03,0.04)}
A002 B002 C002 100 1-1-14 3-31-14 {(101,1-Jan-2014,0.03,0.04)}
Have the values separated by the tabs. Oddly enough, it works. I had the delimiter set to ',' which may have confused pig when it tried to read the bag. I guess if you have bags with multivariate tuples, either set the delimiter to anything but ',' or just don't set it at all.