Search code examples
apache-pigequalsignore-case

EqualsIgnoreCase function - Exception : org.apache.pig.backend.executionengine.ExecException


EqualsIgnoreCase function - Exception : org.apache.pig.backend.executionengine.ExecException

Input :

 a.csv
 -------
  a
  A
 (blank/empty line)
  b
  B
  c
  C

Objective : To select the records which are 'a', 'A', 'b' and 'B'.

Approach 1 :

    A = LOAD 'a.csv' using PigStorage(',') AS (value:chararray);
    B = FILTER A BY LOWER(value) IN ('a','b');
    DUMP B;

    Output :
     (a)
     (A)
     (b)
     (B)

Approach 2 :

    C = FILTER A BY EqualsIgnoreCase(value, 'a') or  EqualsIgnoreCase(value, 'b');

    Output :
     2015-04-27 23:48:21,958 [Thread-30] WARN   org.apache.hadoop.mapred.LocalJobRunner - job_local_0014
        org.apache.pig.backend.executionengine.ExecException
        at org.apache.pig.builtin.EqualsIgnoreCase.exec(EqualsIgnoreCase.java:50)

Trying to understand why this exception is getting thrown. I understand that its because of the blank record.

Tried checking for value NOT being null or empty, still the same error.

  D = FILTER A BY (value IS NOT NULL) OR (TRIM(value) != '') AND (EqualsIgnoreCase(value, 'a') or  EqualsIgnoreCase(value, 'b'));

Any inputs/ thoughts on achieving our objective using Approach 2 is much appreciated.


Solution

  • Yes you are right, string functions EqualsIgnoreCase and TRIM are not able to handle blank string in the input.
    To solve this issue,what ever you did in the last stmt is right, just remove the Trim function it will work.

    C = FILTER A BY (value is not null) and (EqualsIgnoreCase(value, 'a') or  EqualsIgnoreCase(value, 'b'));
    

    Is not null condition will take care of empty(null, space and tab) chars, so TRIM function is not required.