I have the following code
Data1 = LOAD '/user/cloudera/Class Ex 2/Data 1' USING PigStorage(',') as (Name:chararray,ID:chararray,text_1:chararray,Grade_1:int,Grade_2:int,Grade_3:int,Grade_4:int);
Data2 = LOAD '/user/cloudera/Class Ex 2/Data 2' USING PigStorage(',') as (Name:chararray,ID:chararray,text_2:chararray,Grade_5:int,Grade_6:int,Grade_7:int,Grade_8:int);
Data_3 = JOIN Data1 BY Data1.ID,Data2 BY Data2.ID;
Data_4 = FOREACH Data_3 GENERATE $0,$1,$2,$3,$4,$5,$6,$9,$10,$11,$12,$13;
Data_5 = FOREACH Data_4 GENERATE
Name,
ID,
text_1,
SIZE(text_1),
REPLACE(text_1,'or',''),
SIZE(REPLACE(text_1,'or','')),
SIZE(text_1)-SIZE(REPLACE(text_1,'or','')),
text_2,
SIZE(text_2),
REPLACE(text_2,'or',''),
SIZE(REPLACE(text_2,'or','')),
SIZE(text_2)-SIZE(REPLACE(text_2,'or','')),
($3+$4+$5+$6+$8+$9+$10+$11)/8;
DESCRIBE Data_5;
STORE Data_5 Into '/user/cloudera/Class Ex 2/Data_output' USING PigStorage(',');
Essentially I have to load 2 sets of data, and then make some basic text statistics and manipulation. Everything works fine until the last statement, STORE. When I add it I receive the scalar error.
What am I doing wrong here? Thanks guys!
First of all, Pig only evaluates the alias' which finally lead to a STORE
or a DUMP
(this is called lazy evaluation). Hence, your error was always there; It just got caught once you added the STORE
statement. Since you have not pasted the full trace, I would think that your error is in the third statement where you try to access the field ID
using the dot (.
) operator. You need to change it to one of the following:
1) Refer to the field ID
directly since only one field called ID
in both Data1
and Data2
:
Data_3 = JOIN Data1 BY ID, Data2 BY ID;
2) Use ::
instead of .
if you do need to disambiguate:
Data_3 = JOIN Data1 BY Data1::ID, Data2 BY Data2::ID;
If you want to know why the dot (.
) operator caused an error, might help to look at the following question: Getting exception while trying to execute a Pig Latin Script