Search code examples
hadoophdfsbigdataflumespool

Can Spool Dir of flume be in remote machine?


I was trying to fetch files from a remote machine to my hdfs whenever a new file has arrived into a particular folder. I came across the concept of spool dir in flume, and it was working fine if the spool dir is in the same machine where the flume agent is running.

Is there any method to configure a spool dir in a remote machine ?? Please help.


Solution

  • You might be aware that flume can spawn multiple instances, i.e. you can install several flume instances which pass the data between them.

    So to answer your question: no, flume cannot access a remote spool directory. But you can install two agents, one on the machine with the spool directory and one on the hadoop node.

    The first will read from spool and pass it on via avro rpc to the second agent which will flush the data to HDFS.

    it's a simple set up, which requires just a couple of lines of configuration.