Search code examples
hadoopbigdataapache-pig

How to remove duplicates on a column basis in Pig


will anyone help me to remove the old records from my csv file and keep recent record using Pig.

EX: input

Key1 sta DATE

XXXXX P38 17-10-2017

XXXXX P38 12-10-2017

YYYYY P38 11-10-2017

YYYYY P38 23-09-2017

YYYYY P38 14-09-2017

ZZZZZ P38 25-10-2017

ZZZZZ P38 10-10-2017

My expected output would be

Key1 sta DATE

XXXXX P38 17-10-2017

YYYYY P38 11-10-2017

ZZZZZ P38 25-10-2017

And header also be included in an out put.

Please suggest how can I achieve this?


Solution

  • Nested foreach can be used for this case,

    A = LOAD '....' AS (
    B =
        FOREACH (GROUP A BY key1) {
            orderd = ORDER A BY date DESC;
            ltsrow = LIMIT orderd 1;
            GENERATE FLATTEN(ltsrow);
        };
    STORE B into 'output' using PigStorage('\t', '-schema');
    

    To learn about nested foreach, look at this, https://shrikantbang.wordpress.com/2014/01/14/apache-pig-group-by-nested-foreach-join-example/ https://community.mapr.com/thread/22034-apache-pig-nested-foreach-explaination

    and on saving output with schema, https://hadoopified.wordpress.com/2012/04/22/pigstorage-options-schema-and-source-tagging/