Search code examples
hbaseapache-pig

How to join maps in Apache Pig? (stored in HBase)


I have a problem with apache pig and did not know how I can solve it, or if it is possible. Im working with hbase as the "storage layer". The table looks like this:

row key/column  (b1, c1)        (b2, c2)    ...     (bn, cn)
a1              empty           empty               empty   
a2              ...
an              ...         

There are row keys a1 to an and every row has different columns with the syntax (bn, cn). The value of every row/column is empty.

My Pig Programm looks like this:

/* Loading the data */
mydata = load 'hbase://mytable' ... as (a:chararray, b_c:map[]);

/* finding the right elements */ 
sub1 = FILTER mydata BY a == 'a1';
sub2 = FILTER mydata BY a == 'a2');

Now I want to join sub1 and sub2, which means I want to find the columns that exists in both data sub1 and sub2. How can I do this?


Solution

  • Maps will not be able to do anything like this in pure pig. Therefore you are going to need a UDF. I'm not sure exactly what you want to get as output for the join, but it should be fairly easy to tweak the python UDF to your needs.

    myudf.py

    @outputSchema('cols: {(col:chararray)}')
    def join_maps(M1, M2):
        # This literally returns all column names that exist in both maps.
        out = []
        for k,v in M1.iteritems():
            if k in M2 and v is not None and M2[k] is not None:
                out.append(k)
        return out
    

    You can use it like:

    register 'myudf.py' using jython as myudf ;
    
    # We can call sub2 from in sub1 since it only has one row
    D = FOREACH sub1 GENERATE myudf.join_maps(b_c, sub2.b_c) ;