Search code examples
apache-pigclouderascalar

PIg scalar is bigger than 0


I have the following code

Data1 = LOAD '/user/cloudera/Class Ex 2/Data 1' USING PigStorage(',') as (Name:chararray,ID:chararray,text_1:chararray,Grade_1:int,Grade_2:int,Grade_3:int,Grade_4:int);
Data2 = LOAD '/user/cloudera/Class Ex 2/Data 2' USING PigStorage(',') as (Name:chararray,ID:chararray,text_2:chararray,Grade_5:int,Grade_6:int,Grade_7:int,Grade_8:int);

Data_3 = JOIN Data1 BY Data1.ID,Data2 BY Data2.ID;
Data_4 = FOREACH Data_3 GENERATE $0,$1,$2,$3,$4,$5,$6,$9,$10,$11,$12,$13;

Data_5 = FOREACH Data_4 GENERATE
                            Name,
                            ID,
                            text_1,
                            SIZE(text_1),
                            REPLACE(text_1,'or',''),
                            SIZE(REPLACE(text_1,'or','')),
                            SIZE(text_1)-SIZE(REPLACE(text_1,'or','')),
                            text_2,
                            SIZE(text_2),
                            REPLACE(text_2,'or',''),
                            SIZE(REPLACE(text_2,'or','')),
                            SIZE(text_2)-SIZE(REPLACE(text_2,'or','')),
                            ($3+$4+$5+$6+$8+$9+$10+$11)/8;
DESCRIBE Data_5;
STORE Data_5 Into '/user/cloudera/Class Ex 2/Data_output' USING PigStorage(',');

Essentially I have to load 2 sets of data, and then make some basic text statistics and manipulation. Everything works fine until the last statement, STORE. When I add it I receive the scalar error.

What am I doing wrong here? Thanks guys!


Solution

  • First of all, Pig only evaluates the alias' which finally lead to a STORE or a DUMP (this is called lazy evaluation). Hence, your error was always there; It just got caught once you added the STORE statement. Since you have not pasted the full trace, I would think that your error is in the third statement where you try to access the field ID using the dot (.) operator. You need to change it to one of the following:

    1) Refer to the field ID directly since only one field called ID in both Data1 and Data2:

    Data_3 = JOIN Data1 BY ID, Data2 BY ID;
    

    2) Use :: instead of . if you do need to disambiguate:

    Data_3 = JOIN Data1 BY Data1::ID, Data2 BY Data2::ID;
    

    If you want to know why the dot (.) operator caused an error, might help to look at the following question: Getting exception while trying to execute a Pig Latin Script