I am trying to convert my input data which looks like :
Id,Name,Types,Code
1, A, a1;a2;a3, 101
2, B, b1;b2, 202
...
into a flatten structure where the types are separated into individual rows like :
1, A, a1, 101
1, A, a2, 101
1, A, a3, 101
2, B, b1, 202
2, B, b2, 202
...
What I have tried here is after the StrSplit I get a tuple that I try to convert to a BAG which I can then Flatten into individual rows.
input_data = LOAD '/user/gjhawar/latestSkillMappedEn.csv' USING PigStorage('|') AS
(
id : chararray,
name : chararray,
type: chararray,
code : chararray);
a = LIMIT input_data 10;
b = FOREACH a GENERATE (id, name, code), BagToString(TOBAG(STRSPLIT (type,'\\u003B',100)), ' ') as newCategoryName:chararray;
The semicolon will have a problem as a delimiter. Replace it with something else, tokenize and flatten.
http://www.hadooplessons.info/2015/01/word-count-in-pig-latin.html
flattened_input_data = FOREACH a GENERATE skillId, skillName, matchType, culture, FLATTEN(TOKENIZE(REPLACE(categoryName,'\\u003B', '|'), '|')) as newCategoryName;