I have the following inputs:
input = LOAD '$in_data' USING PigStorage('\t', '-schmea') AS (
uid:chararray,
pid:int,
token:chararray
);
stpwrd = LOAD '$stpwrd' USING PigStorage('\t', '-schema') AS (
token:chararray
);
My goal can be summarized in the following pseudo-code:
output = FILTER input BY NOT IN(input.token, stpwrd);
, which ideally gives rows in the input
table whose input.token
field is not in stpwrd
.
I checked the SetDifference()
UDF in datafu
(link), but I am not sure if that will do the job, since it seems to require both table to be singleton, while my input
table has multiple columns.
We can achieve this objective using RIGHT join and filtering those records which are there in stpwrd, an example below illustrates the usage.
Input : input_data
uid1 1 token1
uid2 2 token2
uid3 3 token3
Input : stpwrd
token1
token2
Pig Script :
input_data = LOAD 'input_data' USING PigStorage('\t') AS (
uid:chararray,
pid:int,
token:chararray
);
stpwrd = LOAD 'stpwrd' USING PigStorage('\t') AS (
token:chararray
);
output_data = JOIN stpwrd BY token RIGHT, input_data BY token;
req_data = FILTER output_data BY stpwrd::token IS NULL;
Output : req_data
(,uid3,3,token3)
Project required fields from req_data alias.