Search code examples
apache-pig

Storing data into a file with specific format using pig


I've been doing a project recently in which I need to output the final data in some specific format. Although my actual dataset is quite complex. I will explain my problem by using dummy data.

If I have following data -

1
2
3
4
5
5
4
2
1

Then I want to output this data using pig in the following format -

Between 4 and 8 2
Between 1 and 5 5

Note -> For between 4 and 8 I'm not including 4,8 itself.

The following code I have tried but how could I add Between 4 and 8 to final output in pig.

data = LOAD 'f.txt' AS num:int;
data1 = GROUP data BY num;
data2 = FOREACH data1 GENERATE group AS num, COUNT(data) AS count;
data3 = FILTER data2 BY count > 4 AND count < 8;
data4 = FILTER data3 BY count > 1 AND count < 5;

From here onwards I have no idea how data3, data4 can be stored in a single file in format which I have specified above.


Solution

  • Create two filtered datasets, count them all and union the results into single output. Before writing add the literal text that you want in front of the individual counts.

    data = LOAD 'f.txt' AS num:int;
    
    data3 = FILTER data BY num > 4 AND num < 8;
    data4 = FILTER data BY num > 1 AND num < 5;
    
    data3_grp = GROUP data3 ALL;
    data3_count = FOREACH data3_grp GENERATE 'Between 4 and 8',COUNT(data3);
    
    data4_grp = GROUP data4 ALL;
    data4_count = FOREACH data4_grp GENERATE 'Between 1 and 5',COUNT(data4);
    
    data5 = UNION data3_count,data4_count