Search code examples
apache-pig

Filter data after join using PIG


I would like to filter the records after two files are joined.

The file BX-Books.csv contains the book data. and the file BX-Book-Ratings.csv contains books rating data where ISBN is the common column from both the files. The inner join betweeb the files is done using the this column.
I would like to get the books that are published in the year 2002.

I have used the following script but i am getting 0 records.

grunt> BookXRecords = LOAD '/user/pradeep/BX-Books.csv'  USING PigStorage(';') AS (ISBN:chararray,BookTitle:chararray,BookAuthor:chararray,YearOfPublication:chararray, Publisher:chararray,ImageURLS:chararray,ImageURLM:chararray,ImageURLL:chararray);
grunt> BookXRating = LOAD '/user/pradeep/BX-Book-Ratings.csv'  USING PigStorage(';') AS (user:chararray,ISBN:chararray,rating:chararray);
grunt> BxJoin = JOIN BookXRecords BY ISBN, BookXRating BY ISBN;
grunt> BxJoin_Mod = FOREACH BxJoin GENERATE $0 AS ISBN, $1, $2, $3, $4;
grunt> FLTRBx2002 = FILTER BxJoin_Mod BY $3 == '2002';

Solution

  • I created a test.csv and test-rating.csv and a Pig script that works out of them. It worked perfectly fine.

    test.csv

    1;abc;author1;2002
    2;xyz;author2;2003
    

    test-rating.csv

    user1;1;3
    user2;2;5
    

    Pig Script :

    A = LOAD 'test.csv' USING PigStorage(';') AS (ISBN:chararray,BookTitle:chararray,BookAuthor:chararray,YearOfPublication:chararray);
    describe A;
    dump A;
    
    B = LOAD 'test-rating.csv' USING PigStorage(';') AS (user:chararray,ISBN:chararray,rating:chararray);
    describe B;
    dump B;
    
    C = JOIN A BY ISBN, B BY ISBN;
    describe C;
    dump C;
    
    D = FOREACH C GENERATE $0 as ISBN,$1,$2,$3;
    describe D;
    dump D;
    
    E = FILTER D BY $3 == '2002';
    describe E;
    dump E;
    

    Output:

    A: {ISBN: chararray,BookTitle: chararray,BookAuthor: chararray,YearOfPublication: chararray}
    (1,abc,author1,2002)
    (2,xyz,author2,2003)
    B: {user: chararray,ISBN: chararray,rating: chararray}
    (user1,1,3)
    (user2,2,5)
    C: {A::ISBN: chararray,A::BookTitle: chararray,A::BookAuthor: chararray,A::YearOfPublication: chararray,B::user: chararray,B::ISBN: chararray,B::rating: chararray}
    (1,abc,author1,2002,user1,1,3)
    (2,xyz,author2,2003,user2,2,5)
    D: {ISBN: chararray,A::BookTitle: chararray,A::BookAuthor: chararray,A::YearOfPublication: chararray}
    (1,abc,author1,2002)
    (2,xyz,author2,2003)
    E: {ISBN: chararray,A::BookTitle: chararray,A::BookAuthor: chararray,A::YearOfPublication: chararray}
    (1,abc,author1,2002)