Search code examples
hadoopmapreduceapache-pig

NOT IN function in pig


I'm trying to find out the difference between two tables(Source and Destination) using DIFF() method in pig, in order to achieve that:

sourcenew = LOAD 'hdfs://HADOOPMASTER:54310/DVTTest/Source.txt' USING PigStorage(',') as (ID:chararray,Name:chararray,FirstName:chararray ,LastName:chararray,Vertical_Name:chararray ,Vertical_ID:chararray,Gender:chararray,DOB:chararray,Degree_Percentage:chararray ,Salary:chararray,StateName:chararray);


destnew = LOAD 'hdfs://HADOOPMASTER:54310/DVTTest/Destination.txt' USING PigStorage(',') as (ID:chararray,Name:chararray,FirstName:chararray ,LastName:chararray,Vertical_Name:chararray ,Vertical_ID:chararray,Gender:chararray,DOB:chararray,Degree_Percentage:chararray ,Salary:chararray,StateName:chararray);

cogroupnew= COGROUP sourcenew by ID inner, destnew by ID inner;

diffnew = FOREACH cogroupnew GENERATE DIFF(sourcenew,destnew);

DUMP diffnew;

Gives the differences between two tables or return empty bag{} if tuples matches, it works fine until this, my next step is to find the extra records in source files which is not there in destination, for that

cogroupextrainsource= COGROUP sourcenew by ID inner, destnew by ID;
filterextrainsource= FILTER cogroupextrainsource BY ID NOT (cogroupnew)

its throwing error as expected. Need help to find extra in source. Help would be much appreciated.

Thank you!


Solution

  • You do not need the $ sign next to the column name ID.$ is used only when you do not want to access the column by the name.

    cogroupextrainsource = COGROUP sourcenew by ID inner, destnew by ID;