Does anybody have experience or succeeded on loading data from Bigtable via Pig on Dataproc using HBaseStorage?
Here's a very simple Pig script I'm trying to run. It fails with an error indicating it can't find the BigtableConnection class and I'm wondering what setup I may be missing to successfully load data from Bigtable.
raw = LOAD 'hbase://my_hbase_table'
USING org.apache.pig.backend.hadoop.hbase.HBaseStorage(
'cf:*', '-minTimestamp 1490104800000 -maxTimestamp 1490105100000 -loadKey true -limit 5')
AS (key:chararray, data);
DUMP raw;
Steps I followed to setup my cluster:
hbase-site.xml
for my_bt and BigtableConnection classt.pig
with contents listed abovegcloud beta dataproc jobs submit pig --cluster my_dp --file t.pig --jars /opt/hbase-1.2.1/lib/bigtable/bigtable-hbase-1.2-0.9.5.1.jar
2017-03-21 15:30:48,029 [JobControl] ERROR org.apache.hadoop.hbase.mapreduce.TableInputFormat - java.io.IOException: java.lang.ClassNotFoundException: com.google.cloud.bigtable.hbase1_2.BigtableConnection
The trick is getting all dependencies on pig's classpath. Using the jars pointed to by Solomon, I've created the following initialization action that downloads two jars, the bigtable mapreduce jar and netty-tcnative-boringssl, and sets up the pig classpath.
#!/bin/bash
# Initialization action to set up pig for use with cloud bigtable
mkdir -p /opt/pig/lib/
curl http://repo1.maven.org/maven2/io/netty/netty-tcnative-boringssl-static/1.1.33.Fork19/netty-tcnative-boringssl-static-1.1.33.Fork19.jar \
-f -o /opt/pig/lib/netty-tcnative-boringssl-static-1.1.33.Fork19.jar
curl http://repo1.maven.org/maven2/com/google/cloud/bigtable/bigtable-hbase-mapreduce/0.9.5.1/bigtable-hbase-mapreduce-0.9.5.1-shaded.jar \
-f -o /opt/pig/lib/bigtable-hbase-mapreduce-0.9.5.1-shaded.jar
cat >>/etc/pig/conf/pig-env.sh <<EOF
#!/bin/bash
for f in /opt/pig/lib/*.jar; do
if [ -z "\${PIG_CLASSPATH}" ]; then
export PIG_CLASSPATH="\${f}"
else
export PIG_CLASSPATH="\${PIG_CLASSPATH}:\${f}"
fi
done
EOF
You can then pass in bigtable configuration in the usual ways:
Specifying properties when submitting a job:
PROPERTIES='hbase.client.connection.impl='
PROPERTIES+='com.google.cloud.bigtable.hbase1_2.BigtableConnection'
PROPERTIES+=',google.bigtable.instance.id=MY_INSTANCE'
PROPERTIES+=',google.bigtable.project.id=MY_PROJECT'
gcloud dataproc jobs submit pig --cluster MY_DATAPROC_CLUSTER \
--properties="${PROPERTIES}" \
-e "f = LOAD 'hbase://MY_TABLE'
USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('cf:*','-loadKey true')
AS (key:chararray, data);
DUMP f;"