Search code examples
apache-pig

Casting in Load vs. in separate step


I have following Data:

John,fl,3
John,wt,3
John,sp,4
John,sm,3
Mary,fl,3
Mary,wt,3
Mary,sp,4
Mary,sm,4

I want to calculate average gpa (3rd column) in the data by name (first column). For this I created following pig script and it works just fine.

a = LOAD '/root/sample.txt' using PigStorage(',') as (name:chararray, other:chararray, gpa:int);
b = group a by name;
c = foreach b generate group, AVG(a.gpa);

Then I rewrite the same script as below. This time casting columns in a separate step rather than load but below code gives me cast exception error:

java.lang.Exception: java.lang.ClassCastException: org.apache.pig.data.DataByteArray cannot be cast to java.lang.String
    at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)

a = LOAD '/root/sample.txt' using PigStorage(',');
b = foreach a generate $0 as name:chararray, $1 as other:chararray, $2 as gpa:int;
c = group b by name;
d = foreach c generate group, AVG(b.gpa);

I am unable to understand why??? how are the two code samples different?


Solution

  • This looks like a bug which might get fixed from version - 0.17.

    Ref : Run a String through Java using Pig and https://issues.apache.org/jira/browse/PIG-2315 for details.

    As of now, for second approach to work an explicit type casting is required.

    b = foreach a generate (chararray)$0 as name:chararray, (chararray)$1 as other:chararray, (int)$2 as gpa:int;