Search code examples
apache-pig

Use Pig to Denormalize A Large Data Frame


I have a large-ish (21GB) tab-delimited data frame of the form

DOCID_1    TERMID_1    TITLE_1    YEAR_1    AUTHOR_1
DOCID_1    TERMID_2    TITLE_1    YEAR_1    AUTHOR_1
...
DOCID_n    TERMID_n    TITLE_n    YEAR_n    AUTHOR_n

That is, a (DOCID, TERMID) pair will always uniquely identify a row. What I need, is a data frame in which a DOCID alone uniquely identifies a row, and the TERMIDs are collapsed into a comma-separated chararray list. For example,

DOCID_1    TERMID_11, TERMID_12, ..., TERMID_n    TITLE_1    YEAR_1    AUTHOR_1
...
DOCID_n    TERMID_n1, TERMID_n2, ..., TERMID_n    TITLE_1    YEAR_n    AUTHOR_n

Can anyone think of a good way of doing this in Pig?


Solution

  • SEMINORMALIZED = LOAD 'so.txt' USING PigStorage(',') AS (
        doc_id:chararray
        ,term_id:chararray
        ,title:chararray
        ,year:chararray
        ,author:chararray
    );
    
    KEYS = FOREACH SEMINORMALIZED GENERATE 
        doc_id
        ,term_id
    ;
    
    ATTRIBUTES = FOREACH SEMINORMALIZED GENERATE
        doc_id
        ,title
        ,year
        ,author
    ;
    
    ATTRIBUTES = DISTINCT ATTRIBUTES;
    
    GROUPED = GROUP KEYS BY doc_id;
    
    ZNF = FOREACH GROUPED GENERATE
        group AS doc_id
        ,KEYS.term_id; AS term_ids
    
    DENORMALIZED = JOIN ZNF BY doc_id, ATTRIBUTES BY doc_id;