google-api google-api-java-client google-compute-engine snappy google-hadoop

How to enable Snappy/Snappy Codec over hadoop cluster for Google Compute Engine

I am trying to run Hadoop Job on Google Compute engine against our compressed data, which is sitting on Google Cloud Storage. While trying to read the data through SequenceFileInputFormat, I get the following exception:

hadoop@hadoop-m:/home/salikeeno$ hadoop jar ${JAR} ${PROJECT} ${OUTPUT_TABLE}
14/08/21 19:56:00 INFO jaws.JawsApp: Using export bucket 'askbuckerthroughhadoop' as specified in 'mapred.bq.gcs.bucket'
14/08/21 19:56:00 INFO bigquery.BigQueryConfiguration: Using specified project-id 'regal-campaign-641' for output
14/08/21 19:56:00 INFO gcs.GoogleHadoopFileSystemBase: GHFS version: 1.2.8-hadoop1
14/08/21 19:56:01 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
14/08/21 19:56:03 INFO input.FileInputFormat: Total input paths to process : 1
14/08/21 19:56:09 INFO mapred.JobClient: Running job: job_201408211943_0002
14/08/21 19:56:10 INFO mapred.JobClient:  map 0% reduce 0%
14/08/21 19:56:20 INFO mapred.JobClient: Task Id : attempt_201408211943_0002_m_000001_0, Status : FAILED
java.lang.RuntimeException: native snappy library not available
        at org.apache.hadoop.io.compress.SnappyCodec.getDecompressorType(SnappyCodec.java:189)
        at org.apache.hadoop.io.compress.CodecPool.getDecompressor(CodecPool.java:125)
        at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1581)
        at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1490)
        at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1479)
        at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1474)
        at org.apache.hadoop.mapreduce.lib.input.SequenceFileRecordReader.initialize(SequenceFileRecordReader.java:50)
        at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:521)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:364)
        at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
        at org.apache.hadoop.mapred.Child.main(Child.java:249)

It seems that the SnappyCodec is not available. How should I include/enable Snappy in my Hadoop cluster on google compute engine?
Can I deploy Snappy lib (if I have to) through bdutil script while deploying a Hadoop cluster?
What is the best approach to deploy third party libs/jars on Hadoop cluster deployed on Google Compute engine?

Thanks a lot

Solution

This procedure is no longer required.

A bdutil deployment will contain Snappy by default.

For reference, the original answer:

Your last question is the easiest to answer in the general case so I'll begin there. The general guidance for shipping dependencies is that applications should make use of the distributed cache to distribute JARs and libraries to workers (Hadoop 1 or 2). If your code is already making use of the GenericOptionsParser you can distrubte JARs with the -libjars flag. A longer discussion can be found on Cloudera's blog that also discusses fat JARs: http://blog.cloudera.com/blog/2011/01/how-to-include-third-party-libraries-in-your-map-reduce-job/

For installing and configuring other system-level components bdutil supports an extension mechanism. A good example of extensions is the Spark extension bundled with bdutil: extensions/spark/spark_env.sh. When running bdutil extensions are added with the -e flag e.g., to deploy Spark with Hadoop:

./bdutil -e extensions/spark/spark_env.sh deploy

With regard to your first and second questions: there are two obstacles when dealing with Snappy in Hadoop on GCE. The first is that the native support libraries built by Apache and bundled with Hadoop 2 tarballs are built for i386 while GCE instances are amd64. Hadoop 1 bundles binaries for both platforms, but snappy is not locatable without either bundling or modifying the environment. Because of this architecture difference, no native compressors are usable (snappy or otherwise) in Hadoop 2 and Snappy is not available easily in Hadoop 1. The second obstacle is that libsnappy itself is not installed by default.

The easiest way to overcome both of these is to create your own Hadoop tarball containing amd64 native Hadoop libraries as well as libsnappy. The steps below should help you do this and stage the resulting tarball for use by bdutil.

To start, launch a new GCE VM using a Debian Wheezy backports image and grant the VM service account read/write access to Cloud Storage. We'll use this as our build machine and we can safely discard it as soon as we're done building / storing the binary.

Building Hadoop 1.2.1 with Snappy

SSH to your new instance and run the following commands, checking for any errors along the way:

sudo apt-get update
sudo apt-get install pkg-config libsnappy-dev libz-dev libssl-dev gcc make cmake automake autoconf libtool g++ openjdk-7-jdk maven ant

export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64/

wget http://apache.mirrors.lucidnetworks.net/hadoop/common/hadoop-1.2.1/hadoop-1.2.1.tar.gz

tar zxvf hadoop-1.2.1.tar.gz 
pushd hadoop-1.2.1/

# Bundle libsnappy so we don't have to apt-get install it on each machine
cp /usr/lib/libsnappy* lib/native/Linux-amd64-64/

# Test to make certain Snappy is being loaded and is working:
bin/hadoop jar ./hadoop-test-1.2.1.jar testsequencefile -seed 0 -count 1000 -compressType RECORD xxx -codec org.apache.hadoop.io.compress.SnappyCodec -check

# Create a new tarball of Hadoop 1.2.1:
popd
rm hadoop-1.2.1.tar.gz
tar zcvf hadoop-1.2.1.tar.gz hadoop-1.2.1/

# Store the tarball on GCS: 
gsutil cp hadoop-1.2.1.tar.gz gs://<some bucket>/hadoop-1.2.1.tar.gz

Building Hadoop 2.4.1 with Snappy

SSH to your new instance and run the following commands, checking for any errors along the way:

sudo apt-get update
sudo apt-get install pkg-config libsnappy-dev libz-dev libssl-dev gcc make cmake automake autoconf libtool g++ openjdk-7-jdk maven ant

export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64/

# Protobuf 2.5.0 is required and not in Debian-backports
wget http://protobuf.googlecode.com/files/protobuf-2.5.0.tar.gz
tar xvf protobuf-2.5.0.tar.gz
pushd protobuf-2.5.0/ && ./configure && make && sudo make install && popd
sudo ldconfig

wget http://apache.mirrors.lucidnetworks.net/hadoop/common/hadoop-2.4.1/hadoop-2.4.1-src.tar.gz

# Unpack source
tar zxvf hadoop-2.4.1-src.tar.gz
pushd hadoop-2.4.1-src

# Build Hadoop
mvn package -Pdist,native -DskipTests -Dtar
pushd hadoop-dist/target/
pushd hadoop-2.4.1/

# Bundle libsnappy so we don't have to apt-get install it on each machine
cp /usr/lib/libsnappy* lib/native/

# Test that everything is working:
bin/hadoop jar share/hadoop/common/hadoop-common-2.4.1-tests.jar org.apache.hadoop.io.TestSequenceFile -seed 0 -count 1000 -compressType RECORD xxx -codec org.apache.hadoop.io.compress.SnappyCodec -check

popd

# Create a new tarball with libsnappy:
rm hadoop-2.4.1.tar.gz
tar zcf hadoop-2.4.1.tar.gz hadoop-2.4.1/

# Store the new tarball on GCS:
gsutil cp hadoop-2.4.1.tar.gz gs://<some bucket>/hadoop-2.4.1.tar.gz

popd
popd

Updating bdutil_env.sh or hadoop2_env.sh

Once you have a Hadoop version with the correct native libraries bundled, we can point bdutil at the new Hadoop tarball by updating either bdutil_env.sh for Hadoop 1 or hadoop2_env.sh for Hadoop 2. In either case, open the approprirate file and look for a block along the lines of:

# URI of Hadoop tarball to be deployed. Must begin with gs:// or http(s)://
# Use 'gsutil ls gs://hadoop-dist/hadoop-*.tar.gz' to list Google supplied options
HADOOP_TARBALL_URI='gs://hadoop-dist/hadoop-1.2.1-bin.tar.gz'

and change the URI pointed to to be the URI where we stored the tarball above: e.g.,

HADOOP_TARBALL_URI='gs://<some bucket>/hadoop-1.2.1.tar.gz'