I've been doing a project recently in which I need to output the final data in some specific format. Although my actual dataset is quite complex. I will explain my problem by using dummy data.
If I have following data -
1
2
3
4
5
5
4
2
1
Then I want to output this data using pig in the following format -
Between 4 and 8 2
Between 1 and 5 5
Note -> For between 4 and 8 I'm not including 4,8 itself.
The following code I have tried but how could I add Between 4 and 8
to final output in pig.
data = LOAD 'f.txt' AS num:int;
data1 = GROUP data BY num;
data2 = FOREACH data1 GENERATE group AS num, COUNT(data) AS count;
data3 = FILTER data2 BY count > 4 AND count < 8;
data4 = FILTER data3 BY count > 1 AND count < 5;
From here onwards I have no idea how data3, data4 can be stored in a single file in format which I have specified above.
Create two filtered datasets, count them all and union the results into single output. Before writing add the literal text that you want in front of the individual counts.
data = LOAD 'f.txt' AS num:int;
data3 = FILTER data BY num > 4 AND num < 8;
data4 = FILTER data BY num > 1 AND num < 5;
data3_grp = GROUP data3 ALL;
data3_count = FOREACH data3_grp GENERATE 'Between 4 and 8',COUNT(data3);
data4_grp = GROUP data4 ALL;
data4_count = FOREACH data4_grp GENERATE 'Between 1 and 5',COUNT(data4);
data5 = UNION data3_count,data4_count