Search code examples
apache-pigsurrogate-key

Generate surrogate key in PIG using custom rank


I will be doing PIG transformation daily (new data everyday). And I need to generate Unique key for data pulled everyday. what would be best approach ? If I perform does rank for tomarow will overwrite today rank ?


Solution

  • Your ranking will start at 1 each time you kick it off. If you want to generate unique data per day, I would recommend using the datafu hash method on concat(rank + date). You'll end up with a unique hash that can be used as a surrogate key.

    REGISTER datafu-1.2.0.jar
    DEFINE SHA datafu.pig.hash.SHA();
    
    S1 = LOAD 'surrogate_hash' USING PigStorage('|') AS (c1:chararray,date:chararray,c3:chararray);
    S2 = RANK S1;
    S3 = FOREACH S2 GENERATE SHA((chararray)CONCAT((chararray)rank_S1,date)),c1,date,c3;
    
    dump S3;