How can correct data types on Apache Pig be enforced?

I am having trouble SUMming a bag of values, due to a Data type error.

When I load a csv file whose lines look like this:

6   574 false  2010-05-16 13:56:19 +0930 304 text/css    1   /rsrc.php/zPTJC/hash/50l7x7eg.css   http    pwong

Using the following:

logs_base = FOREACH raw_logs GENERATE
     EXTRACT(line, '^(\\d+),"(\\d+)","(\\w+)","(\\S+)","(.+?)","(\\S+)","(\\S+)","(\\d+)","(\\S+)","(\\d+)","(\\S+)","(\\S+)","(\\S+)"')
  as (
    account_id: int,
    bytes: long,
    cached: chararray,
    ip: chararray,
    time: chararray,
    domain: chararray,
    host: chararray,
    status: chararray,
    mime_type: chararray,
    page_view: chararray,
    path: chararray,
    protocol: chararray,
    username: chararray

All fields seem to be loaded fine, and with the right type, as shown by the "describe" command:

grunt> describe logs_base
logs_base: {account_id: int,bytes: long,cached: chararray,ip: chararray,time: chararray,domain: chararray,host: chararray,status: chararray,mime_type: chararray,page_view: chararray,path: chararray,protocol: chararray,username: chararray}

Whenever I perform a SUM using:

bytesCount = FOREACH (GROUP logs_base ALL) GENERATE SUM(logs_base.bytes);

and store, or dump the contents, the mapreduce process fails with the following error:

org.apache.pig.backend.executionengine.ExecException: ERROR 2106: Error while computing sum in Initial
    at org.apache.pig.builtin.LongSum$Initial.exec(
    at org.apache.pig.builtin.LongSum$Initial.exec(
    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(
    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(
    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(
    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(
    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(
    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(
    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(
    at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(
    at org.apache.hadoop.mapred.MapTask.runNewMapper(
    at org.apache.hadoop.mapred.LocalJobRunner$
Caused by: java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Long
    at org.apache.pig.builtin.LongSum$Initial.exec(
    ... 15 more

The line that catches my attention is:

Caused by: java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Long

Which leads me to believe that the extract function is not converting the bytes field to the required data type (long).

Is there a way to enforce the extract function to convert to the correct data types? How can I cast the value, without having to do a FOREACH on all the records? (Same problem happens when converting the time to a unix time stamp, and attempting to find MIN. I definitely would like to find a solution that does not require unnecessary projections).

Any pointers will be appreciated. Thanks a lot for your help.

Regards, Jorge C.

P.S. I am running this in interactive mode on Amazon elastic mapreduce service.


  • Have you tried to cast the data retrieved from the UDF? Applying the schema here does not perform any casting.


    logs_base = 
       FOREACH raw_logs
               (tuple(LONG,LONG,CHARARRAY,....)) EXTRACT(line, '^...')
           AS (account_id: INT, ...);