I have following Data:
John,fl,3
John,wt,3
John,sp,4
John,sm,3
Mary,fl,3
Mary,wt,3
Mary,sp,4
Mary,sm,4
I want to calculate average gpa (3rd column) in the data by name (first column). For this I created following pig script and it works just fine.
a = LOAD '/root/sample.txt' using PigStorage(',') as (name:chararray, other:chararray, gpa:int);
b = group a by name;
c = foreach b generate group, AVG(a.gpa);
Then I rewrite the same script as below. This time casting columns in a separate step rather than load but below code gives me cast exception error:
java.lang.Exception: java.lang.ClassCastException: org.apache.pig.data.DataByteArray cannot be cast to java.lang.String
at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
a = LOAD '/root/sample.txt' using PigStorage(',');
b = foreach a generate $0 as name:chararray, $1 as other:chararray, $2 as gpa:int;
c = group b by name;
d = foreach c generate group, AVG(b.gpa);
I am unable to understand why??? how are the two code samples different?
This looks like a bug which might get fixed from version - 0.17.
Ref : Run a String through Java using Pig and https://issues.apache.org/jira/browse/PIG-2315 for details.
As of now, for second approach to work an explicit type casting is required.
b = foreach a generate (chararray)$0 as name:chararray, (chararray)$1 as other:chararray, (int)$2 as gpa:int;