Search code examples
apache-pigcomm

basic PIG issue - subtract a group from an other


I would like to subtract a group from an other on PIG. I would like to do exactly the same what "comm -23" command is doing on bash, but I can't find any documentation about that on the internet.

So for example: GROUP A is: 1 2 3 4 5 6

GROUP B is: 3 4 5 6 7

And the output, that i need is: GROUP A - GROUP B: 1 2


Solution

  • As WinnieNicklaus suggested, DataFu is a good resource. I wrote the SetDifference UDF for exactly this use case. Assuming you are working with bags, this will work for your use case.

    Example from the documentation:

    define SetDifference datafu.pig.sets.SetDifference();
    
    -- input:
    -- ({(1),(2),(3),(4),(5),(6)},{(3),(4)})
    input = LOAD 'input' AS (B1:bag{T:tuple(val:int)},B2:bag{T:tuple(val:int)});
    
    input = FOREACH input {
      B1 = ORDER B1 BY val ASC;
      B2 = ORDER B2 BY val ASC;
    
      -- output:
      -- ({(1),(2),(5),(6)})
      GENERATE SetDifference(B1,B2);
    }