Search code examples
apacheloggingamazon-s3streamingflume

Use flume to stream data to S3


I am trying flume for something very simple, where I would like to push content from my log files to S3. I was able to create a flume agent that would read the content from an apache access log file and use a logger sink. Now I am trying to find a solution where I can replace the logger sink with an "S3 sink". (I know this does not exist by default)

I was looking for some pointers to direct me in the correct path. Below is my test properties file that I am using currently.

a1.sources=src1
a1.sinks=sink1
a1.channels=ch1

#source configuration
a1.sources.src1.type=exec
a1.sources.src1.command=tail -f /var/log/apache2/access.log

#sink configuration
a1.sinks.sink1.type=logger

#channel configuration
a1.channels.ch1.type=memory
a1.channels.ch1.capacity=1000
a1.channels.ch1.transactionCapacity=100

#links
a1.sources.src1.channels=ch1
a1.sinks.sink1.channel=ch1

Solution

  • S3 is built over HDFS so you can use HDFS sink, you must replace hdfs path to your bucket in this way. Don't forget replace AWS_ACCESS_KEY and AWS_SECRET_KEY.

    agent.sinks.s3hdfs.type = hdfs
    agent.sinks.s3hdfs.hdfs.path = s3n://<AWS.ACCESS.KEY>:<AWS.SECRET.KEY>@<bucket.name>/prefix/
    agent.sinks.s3hdfs.hdfs.fileType = DataStream
    agent.sinks.s3hdfs.hdfs.filePrefix = FilePrefix
    agent.sinks.s3hdfs.hdfs.writeFormat = Text
    agent.sinks.s3hdfs.hdfs.rollCount = 0
    agent.sinks.s3hdfs.hdfs.rollSize = 67108864  #64Mb filesize
    agent.sinks.s3hdfs.hdfs.batchSize = 10000
    agent.sinks.s3hdfs.hdfs.rollInterval = 0