Input data set
data.csv
----------
col1,col2,col3
68,emp101,a1
74,emp101,null
56,emp101,a1
67,emp101,a2
45,emp102,b1
78,emp102,b2
23,emp102,b3
For the col2, I need to find distinct values for col3 excluding null.
emp101 has 2 distinct values -----> a1,a2 emp102 has 3 distinct values -----> b1,b2,b3
emp101 has 4 records and 2 distinct values, the 4 records must be replicated 2 times with addition of new col4, which will be distinct value of col3 for each copy.
emp102 has 3 records and 3 distinct values, the 3 records must be replicated 3 times with addition of new col4, which will be distinct value of col3 for each copy.
Expected Output
col1,col2,col3,col4
68,emp101,a1,a1
74,emp101,null,a1
56,emp101,a1,a1
67,emp101,a2,a1
68,emp101,a1,a2
74,emp101,null,a2
56,emp101,a1,a2
67,emp101,a2,a2
45,emp102,b1,b1
78,emp102,b2,b1
23,emp102,b3,b1
45,emp102,b1,b2
78,emp102,b2,b2
23,emp102,b3,b2
45,emp102,b1,b3
78,emp102,b2,b3
23,emp102,b3,b3
grunt>input1= load 'data.csv' using PigStorage(',') as (age: int, eid: chararray, grade: chararray);
grunt>input2= GROUP input1 by eid;
grunt> input3= distinct input1 by eid,grade;
2017-05-26 08:35:59,056 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: <line 31, column 24> mismatched input 'by' expecting SEMI_COLON
Let's call the original relation, base.
-- Create a relation with col2 and col3
-- Filter out where col3 is null
-- Take distinct tuples (equivalent to records)
col23_rel = DISTINCT (FOREACH (FILTER base BY col3 is not null) generate col2, col3);
dump col23_rel;
(emp101,a1)
(emp101,a2)
(emp102,b1)
(emp102,b2)
(emp102,b3)
-- Now, join col23_rel back to base on col2. This will generate desired output.
jnd = JOIN base by col2, col23_rel by col2;
dump jnd;
(68,emp101,a1,emp101,a1)
(68,emp101,a1,emp101,a2)
(74,emp101,,emp101,a1)
(74,emp101,,emp101,a2)
(56,emp101,a1,emp101,a1)
(56,emp101,a1,emp101,a2)
(67,emp101,a2,emp101,a1)
(67,emp101,a2,emp101,a2)
(45,emp102,b1,emp102,b1)
(45,emp102,b1,emp102,b2)
(45,emp102,b1,emp102,b3)
(78,emp102,b2,emp102,b1)
(78,emp102,b2,emp102,b2)
(78,emp102,b2,emp102,b3)
(23,emp102,b3,emp102,b1)
(23,emp102,b3,emp102,b2)
(23,emp102,b3,emp102,b3)