I have data that looks like this:
(a,b,c)
(a,c,b)
(a,b,d)
Is there something like DISTINCT that will produce output like the following?
(a,b,c)
(a,b,d)
I would like to ignore order and just compare elements.
No.Your best option is to write an UDF that will take each row,sort the fields and return an ordered string and then use distinct on it.
PIG
REGISTER ORDER_UDF.jar;
A = LOAD 'data.txt' USING PigStorage(',') AS (a1: chararray, a2: chararray, a3: chararray);
B = FOREACH A GENERATE ORDER_UDF.ORDER(CONCAT(CONCAT(a1,a2),a3));
C = DISTINCT B;
D = FOREACH C GENERATE REPLACE($0,'',','); -- Get back the comma separated fields from the concatenated string.
DUMP D;
UDF
import java.io.IOException;
import java.util.Arrays;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
public class ORDER extends EvalFunc<String>
{
public String exec(Tuple input) throws IOException {
if (input == null || input.size() == 0)
return null;
try{
char tempArray[] = ((String)input).toCharArray();
Arrays.sort(tempArray);
return new String(tempArray);
}catch(Exception e){
throw new IOException("Caught exception processing input row ", e);
}
}
}