I've a dataset (CSV) that has three value columns (v1, 2 and 3) with a value. The description of the value is stored as a comma separated string in the column 'keys'.
| v1 | v2 | v3 | keys |
| A | C | E | X,Y,Z |
Using Pig I would like to load this information in a HBase table where the Column Family is C and the Column Qualifier is the key.
| C:X | C:Y | C:Z |
| A | C | E |
Has anyone done this before and would like to share this knowledge?
Another option is to store a map (key#value) in a HBase column. But I'm not sure if this is flexible for querying the data?
Found a solution to my problem
test.pig:
REGISTER data.py using jython as myfuncs
A = LOAD 'data' using PigStorage('|') AS (
id:chararray,
date:chararray,
v1:chararray,
v2:chararray,
v3:chararray,
keys:chararray,
);
B = FOREACH A {
GENERATE
id,
date,
myfuncs.dataToMap(STRSPLIT(keys, ','), TOTUPLE(v1, v2, v3)) as kv;
}
STORE B INTO 'pig_table' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage( 'e:date kv:*' );
data.py:
import org.apache.pig.data.DataType as DataType
import org.apache.pig.impl.logicalLayer.schema.SchemaUtil as SchemaUtil
@outputSchema("ud:map[]")
def dataToMap(keys, values):
result = dict()
keys = list(keys)
values = list(values)
try:
while True:
values.remove(None)
except ValueError:
pass
for idx in range(len(keys)):
result[keys[idx]] = values[idx]
return result