Search code examples
hbaseapache-pig

Apache Pig: Dynamic columns


I've a dataset (CSV) that has three value columns (v1, 2 and 3) with a value. The description of the value is stored as a comma separated string in the column 'keys'.

| v1 | v2 | v3 | keys  |
| A  | C  | E  | X,Y,Z |

Using Pig I would like to load this information in a HBase table where the Column Family is C and the Column Qualifier is the key.

| C:X | C:Y | C:Z |
| A   | C   | E   |

Has anyone done this before and would like to share this knowledge?

Another option is to store a map (key#value) in a HBase column. But I'm not sure if this is flexible for querying the data?


Solution

  • Found a solution to my problem

    test.pig:

    REGISTER data.py using jython as myfuncs
    
    A = LOAD 'data' using PigStorage('|') AS (
        id:chararray,
        date:chararray,
        v1:chararray,
        v2:chararray,
        v3:chararray,
        keys:chararray,
    );
    
    B = FOREACH A {
    GENERATE
        id,
        date,
        myfuncs.dataToMap(STRSPLIT(keys, ','), TOTUPLE(v1, v2, v3)) as kv;
    }
    
    STORE B INTO 'pig_table' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage( 'e:date kv:*' );
    

    data.py:

    import org.apache.pig.data.DataType as DataType
    import org.apache.pig.impl.logicalLayer.schema.SchemaUtil as SchemaUtil
    
    @outputSchema("ud:map[]")
    def dataToMap(keys, values):
    
    result = dict()
    keys = list(keys)
    values = list(values)
    
    try:
        while True:
            values.remove(None)
    except ValueError:
        pass
    
    for idx in range(len(keys)):
        result[keys[idx]] = values[idx]
    
    return result