Search code examples
hadoopapache-pig

Creating a massive filter by in pig


I have this code.

large = load 'a super large file' 

CC = FILTER large BY $19 == 'abc OR $20 == 'abc' 
OR $19 == 'def' or $20 == 'def' ....;

The number of OR conditions can go up to 100s or even thousands.

Is there a better way to do this ?


Solution

  • Yes,put those conditions in another file.Load it into a relation and join the two relations on the column.If you have to filter on multiple columns then create as many filter files as the conditions.Below is an example for 2 columns

    large = load 'a super large file' 
    filter1 = load 'file with values needed to compare with $19';
    filter2 = load 'file with values needed to compare with $20';
    f1 = JOIN large BY $19,filter1 BY $0;
    f2 = JOIN large BY $20,filter2 BY $0;
    final = UNION f1,f2;
    DUMP final;
    

    You can probably use 1 filter file with multiple columns and join on those to get different filtered results and then just union the relations.

    large = load 'a super large file' 
    filter_file = load 'file with values in different columns';
    
    f1 = JOIN large BY $19,filter_file BY $0;
    f2 = JOIN large BY $20,filter_file BY $1;
    final = UNION f1,f2;
    DUMP final;