Search code examples
apache-pig

Selecting rows in pig latin


I have data that looks like this:

(a,b,c)
(a,c,b)
(a,b,d)

Is there something like DISTINCT that will produce output like the following?

(a,b,c)
(a,b,d)

I would like to ignore order and just compare elements.


Solution

  • No.Your best option is to write an UDF that will take each row,sort the fields and return an ordered string and then use distinct on it.

    PIG

    REGISTER ORDER_UDF.jar;
    A = LOAD 'data.txt' USING PigStorage(',') AS (a1: chararray, a2: chararray, a3: chararray);
    B = FOREACH A GENERATE ORDER_UDF.ORDER(CONCAT(CONCAT(a1,a2),a3));
    C = DISTINCT B;
    D = FOREACH C GENERATE REPLACE($0,'',','); -- Get back the comma separated fields from the concatenated string.
    DUMP D;
    

    UDF

      import java.io.IOException;
      import java.util.Arrays;
      import org.apache.pig.EvalFunc;
      import org.apache.pig.data.Tuple;
    
       public class ORDER extends EvalFunc<String>
       {
         public String exec(Tuple input) throws IOException {
            if (input == null || input.size() == 0)
                return null;
            try{
                char tempArray[] = ((String)input).toCharArray();
                Arrays.sort(tempArray);       
                return new String(tempArray);
            }catch(Exception e){
                throw new IOException("Caught exception processing input row ", e);
            }
        }
      }