Search code examples
hadoopmapreducehiveapache-pig

How to flatten recursive hierarchy using Hive/Pig/MapReduce


I have unbalanced tree data stored in tabular format like:

parent,child
a,b
b,c
c,d
c,f
f,g

enter image description here

The depth of tree is unknow.

how to flatten this hierarchy where each row contains entire path from leaf node to root node in a row as:

leaf node, root node, intermediate nodes
d,a,d:c:b
f,a,e:b

Any suggestions to solve above problem using hive, pig or mapreduce? Thanks in advance.


Solution

  • I tried to solve it using pig, here are the sample code:

    Join function:

    -- Join parent and child
    Define join_hierarchy ( leftA, source, result) returns output {
        joined= join $leftA by parent left, $source by child;
        tmp_filtered= filter joined by source::parent is null;
        part= foreach tmp_filtered leftA::child as child, leftA::path as path;
        $result= union part, $result;
        part_remaining= filter joined by source::parent is not null;
        $output= foreach part_remaining generate $leftA::child as child, source::parent as parent, concat(concat(source::parent,':'),$leftA::path)
     }
    

    Load dataset:

    --My dataset field delimiter is ','.    
    source= load '*****' using pigStorage(',') as (parent:chararray, child:chararray);
    --create additional column for path
    leftA= foreach source generate child, parent, concat(parent,':');  
    
    --initially result table will be blank.
    result= limit leftA 1;
    result= foreach result generate '' as child , '' as parent;
    --Flatten hierarchy to 4 levels. Add below lines equivalent to hierarchy depth.
    
    leftA= join_hierarchy(leftA, source, result);
    leftA= join_hierarchy(leftA, source, result);
    leftA= join_hierarchy(leftA, source, result);
    leftA= join_hierarchy(leftA, source, result);