Search code examples
hadoopapache-pigudf

Presence of "in" in Pig's UDF causes problems


I was trying my first UDF in pig and wrote the following function -

package com.pig.in.action.assignments.udf;

import org.apache.pig.EvalFunc;
import org.apache.pig.PigWarning;
import org.apache.pig.data.Tuple;

import java.io.IOException;


public class CountLength extends EvalFunc<Integer> {

    public Integer exec(Tuple inputVal) throws IOException {

        // Validate Input Value ...
        if (inputVal == null ||
            inputVal.size() == 0 ||
            inputVal.get(0) == null) {

            // Emit warning text for user, and skip this iteration
            super.warn("Inappropriate parameter, Skipping ...",
                       PigWarning.SKIP_UDF_CALL_FOR_NULL);
            return null;
        }

        // Count # of characters in this string ...
        final String inputString = (String) inputVal.get(0);

        return inputString.length();

    }

}

However, when I try to use it as follows, Pig throws an error message that it not easy to understand atleast for me in the context of my UDF :

grunt> cat dept.txt;
10,ACCOUNTING,NEW YORK
20,RESEARCH,DALLAS
30,SALES,CHICAGO
40,OPERATIONS,BOSTON

grunt> dept = LOAD '/user/sgn/dept.txt' USING PigStorage(',') AS (dept_no: INT, d_name: CHARARRAY, d_loc: CHARARRAY);
grunt> d = FOREACH dept GENERATE dept_no, com.pig.in.action.assignments.udf.CountLength(d_name);

2015-06-02 16:24:13,416 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: <line 2, column 79>  mismatched input '(' expecting SEMI_COLON
Details at logfile: /home/sgn/pig_1433261973141.log

Can anyone help me figuring out whats wrong with this ?

I have gone through the documentation, but nothing seems obvious to me that is wrong in the sample above. Am I missing something here ?

These are the libraries I am using in pom.xml :

<dependency>
    <groupId>org.apache.pig</groupId>
    <artifactId>pig</artifactId>
    <version>0.14.0</version>
</dependency>

<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-core</artifactId>
    <version>1.2.1</version>
</dependency>

Is there any compatibility problem ?

Thanks,

-Vipul Pathak;


Solution

  • Found the reason of the problem after about 36 hours of downtime ...

    The package name contains "IN" which somehow was the problem to Pig.

    package com.pig.in.action.assignments.udf;
    //              ^^
    

    When I changed the package name to the following, everything was good -

    package com.pig.nnn.action.assignments.udf;
    //              ^^^
    

    After building my modified UDF, I registered the Jar and Defined an alias for the function name and bingo, everything worked -

    REGISTER /user/sgn/UDFs/Pig/CountLength-1.jar;
    DEFINE  CL  com.pig.nnn.action.assignments.udf.CountLength;
    
    .   .   .
    .   .   .
    d = FOREACH dept GENERATE dept_no, CL(d_name) AS DeptLength;
    

    I don't recall if IN is a reserve word in Pig. But still presence of IN causes problem, (atleast in version 0.14.0 of Pig).