Search code examples
hbaseapache-pigavro

Using UDF on Avro file in PIg script


I'm importing an avro file on HDFS to HBase using Pig, but I have to apply a user defined function (UDF) to the row id. I'm using the SHA function from Apache DataFU

register datafu-pig-incubating-1.3.0.jar
define SHA datafu.pig.hash.SHA();

set hbase.zookeeper.quorum 'localhost';
set mapreduce.fileoutputcommitter.marksuccessfuljobs 'false';

avro = LOAD '/user/myuser/avro/' USING AvroStorage();
partitioned = FOREACH avro GENERATE SHA(ROW_ID) as key,VALUE_1,VALUE_2;

STORE partitioned INTO 'hbase://MYTABLE' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('cf:value_1 cf:value_2');

I've tried the sample scripts from the DataFU website, and they complete successfully, and if I remove the SHA() call from the script it completes, so what am I missing?


Solution

  • Never mind, it's my own fault. The SHA() call is expecting a string parameter, the ROW_ID is defined as a long, I added a cast to chararray for ROW_ID and it works now

    There was no error in the logs if I ran the script as part of an oozie workflow, but if I entered it line by line into the grunt shell, I got an error message after the "partitioned = " line

    For anyone experiencing problems with UDFs, I'd recommend entering the script line by line in the shell first