Search code examples
filternlpapache-pigtokenapache-datafu

Pig: Filter multi-column table on Set


I have the following inputs:

input = LOAD '$in_data' USING PigStorage('\t', '-schmea') AS (
   uid:chararray,
   pid:int,
   token:chararray
);
stpwrd = LOAD '$stpwrd' USING PigStorage('\t', '-schema') AS (
   token:chararray
);

My goal can be summarized in the following pseudo-code:

output = FILTER input BY NOT IN(input.token, stpwrd);

, which ideally gives rows in the input table whose input.token field is not in stpwrd.

I checked the SetDifference() UDF in datafu (link), but I am not sure if that will do the job, since it seems to require both table to be singleton, while my input table has multiple columns.


Solution

  • We can achieve this objective using RIGHT join and filtering those records which are there in stpwrd, an example below illustrates the usage.

    Input : input_data

    uid1    1   token1
    uid2    2   token2
    uid3    3   token3
    

    Input : stpwrd

    token1
    token2
    

    Pig Script :

        input_data = LOAD 'input_data' USING PigStorage('\t') AS (
       uid:chararray,
       pid:int,
       token:chararray
    );
    
    stpwrd = LOAD 'stpwrd' USING PigStorage('\t') AS (
       token:chararray
    );
    
    output_data = JOIN stpwrd BY token RIGHT, input_data BY token;
    
    req_data = FILTER output_data BY stpwrd::token IS NULL;
    

    Output : req_data

    (,uid3,3,token3)
    

    Project required fields from req_data alias.