I would like to write data from flume-ng to Google Cloud Storage. It is a little bit complicated, because I observed a very strange behavior. Let me explain:
I've launched a hadoop cluster on google cloud (one click) set up to use a bucket.
When I ssh on the master and add a file with hdfs
command, I can see it immediately in my bucket
$ hadoop fs -ls /
14/11/27 15:01:41 INFO gcs.GoogleHadoopFileSystemBase: GHFS version: 1.2.9-hadoop2
Found 1 items
-rwx------ 3 hadoop hadoop 40 2014-11-27 13:45 /test.txt
But when I try to add then read from my computer, it seems to use some other HDFS. Here I added a file called jp.txt
, and it doesn't show my previous file test.txt
$ hadoop fs -ls hdfs://ip.to.my.cluster/
Found 1 items
-rw-r--r-- 3 jp supergroup 0 2014-11-27 14:57 hdfs://ip.to.my.cluster/jp.txt
That's also the only file I see when I explore HDFS on http://ip.to.my.cluster:50070/explorer.html#/
When I list files in my bucket with the web console (https://console.developers.google.com/project/my-project-id/storage/my-bucket/), I can only see test.txt
and not jp.txt
.
I read Hadoop cannot connect to Google Cloud Storage and I configured my hadoop client accordingly (pretty hard stuff) and now I can see items in my bucket. But for that, I need to use a gs://
URI
$ hadoop fs -ls gs://my-bucket/
14/11/27 15:57:46 INFO gcs.GoogleHadoopFileSystemBase: GHFS version: 1.3.0-hadoop2
Found 1 items
-rwx------ 3 jp jp 40 2014-11-27 14:45 gs://my-bucket/test.txt
So it seems here there are 2 different storages engine in the same cluster: "traditional HDFS" (starting with hdfs://
) and a Google storage bucket (starting with gs://
).
Users and rights are different, depending on where you are listing files from.
The main question is: What is the minimal setup needed to write to HDFS/GS on Google Cloud Storage with flume ?
a1.sources = http
a1.sinks = hdfs_sink
a1.channels = mem
# Describe/configure the source
a1.sources.http.type = org.apache.flume.source.http.HTTPSource
a1.sources.http.port = 9000
# Describe the sink
a1.sinks.hdfs_sink.type = hdfs
a1.sinks.hdfs_sink.hdfs.path = hdfs://ip.to.my.cluster:8020/%{env}/%{tenant}/%{type}/%y-%m-%d
a1.sinks.hdfs_sink.hdfs.filePrefix = %H-%M-%S_
a1.sinks.hdfs_sink.hdfs.fileSuffix = .json
a1.sinks.hdfs_sink.hdfs.round = true
a1.sinks.hdfs_sink.hdfs.roundValue = 10
a1.sinks.hdfs_sink.hdfs.roundUnit = minute
# Use a channel which buffers events in memory
a1.channels.mem.type = memory
a1.channels.mem.capacity = 1000
a1.channels.mem.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.http.channels = mem
a1.sinks.hdfs_sink.channel = mem
Does the line a1.sinks.hdfs_sink.hdfs.path accept a gs://
path ?
What setup would it need in that case (additional jars, classpath) ?
Thanks
As you observed, it's actually fairly common to be able to access different storage systems from the same Hadoop cluster, based on the scheme://
of the URI you use with hadoop fs
. The cluster you deployed on Google Compute Engine also has both filesystems available, it just happens to have the "default" set to gs://your-configbucket
.
The reason you had to include the gs://configbucket/file
instead of just plain /file
on your local cluster is that in your one-click deployment, we additionally included a key in your Hadoop's core-site.xml
, setting fs.default.name
to be gs://configbucket/
. You can achieve the same effect on your local cluster to make it use GCS for all the schemeless paths; in your one-click cluster, check out /home/hadoop/hadoop-install/core-site.xml
for a reference of what you might carry over to your local setup.
To explain the internals of Hadoop a bit, the reason hdfs://
paths work normally is actually because there is a configuration key which in theory can be overridden in Hadoop's core-site.xml
file, which by default sets:
<property>
<name>fs.hdfs.impl</name>
<value>org.apache.hadoop.hdfs.DistributedFileSystem</value>
<description>The FileSystem for hdfs: uris.</description>
</property>
Similarly, you may have noticed that to get gs://
to work on your local cluster, you provided fs.gs.impl
. This is because DistribtedFileSystem and GoogleHadoopFileSystem both implement the same Hadoop Java interface FileSystem
, and Hadoop is built to be agnostic to how an implementation chooses to actually implement the FileSystem methods. This also means that at the most basic level, anywhere you could normally use hdfs://
you should be able to use gs://
.
So, to answer your questions:
hdfs://
paths with gs://
paths instead, and/or setting fs.default.name
to be your root gs://configbucket
path.appends
to existing files or symlinks.