How do I transform events in Flume and send them to another channel?

Flume has some ready components to transform events before pushing them further - like RegexHbaseEventSerializer you can stick into an HBaseSink. Also, it's easy to provide a custom serializer.

I want to process events and send them to the next channel. Most close to what I want is Regex Extractor Interceptor , which accepts a custom serialiser for regexp matches. But it does not substitute event body, just appends new headers with results to events, thus making output flow heavier. I'd like to accept big sized events, like zipped html > 5KB, parse them and put many slim messages, like urls found in pages, to another channel.

                  channel1                channel2
HtmlPagesSource -----------> PageParser -----------> WhateverSinkGoesNext
                    html                    urls

Do I have to write a custom sink for that, or is there some type of component that accepts custom serializers, like HBaseSink?

If I write a sink, do I just use Flume client SDK and call append(Event) or appendBatch(List) when processing incoming events?

Solution

It seems like you need run two Flume agents:

Agent1: HtmlPagesSource -> channel -> PageParser (extends AvroSink and overrides process method that can parse input and write many slim messages)

Agent2: AvroSource -> channel -> WhateverSinkGoesNext

Look for some examples of chaining Flume data flows: http://www.ibm.com/developerworks/library/bd-flumews/#N10081