Search code examples
hadoopapache-sparkhbaseprotocol-buffershfile

Any ideas on how to bulk loading protocol buffer file via HFile onto HBase?


Here's what I'm trying to do:

Load data from Hive into HBase serialized by protocol buffers.

I've tried multiple ways:

  1. create connections directly to HBase and do Puts into HBase. This works, but apparently not very efficient.

  2. I imported the json table out from Hive in S3 and stored them as textfiles (separated by tab), and then use importTsv utilities to generate HFile and bulkload them into HBase, this also works.

But now I want to achieve this in an even more efficient way:

Export my data from Hive table in S3, serialize them into protocol buffers objects, then generate HFile and mount the HFile directly onto HBase.

I'm using Spark job to read from Hive and that can give me JavaRDD, then I could build my protocol buffers objects, but I'm at a loss how to proceed from there.

So my question: how can I generate HFile from protocol buffers objects. We don't want to save them as a textfile on local disk or HDFS, how can I directly generate HFile from there?

Thanks a lot!


Solution

  • Thanks to @Samson pointing to that awesome post.

    After trials and error, I got things working. Just to save others pain, here's the working example.

    What it does: It uses Spark to read data from S3, repartition them into corresponding regions, generate HFiles.