Flume has some ready components to transform events before pushing them further -
like RegexHbaseEventSerializer
you can stick into an HBaseSink
. Also, it's easy to provide a custom serializer.
I want to process events and send them to the next channel. Most close to what I want is Regex Extractor Interceptor , which accepts a custom serialiser for regexp matches. But it does not substitute event body, just appends new headers with results to events, thus making output flow heavier. I'd like to accept big sized events, like zipped html > 5KB, parse them and put many slim messages, like urls found in pages, to another channel.
channel1 channel2
HtmlPagesSource -----------> PageParser -----------> WhateverSinkGoesNext
html urls
Do I have to write a custom sink for that, or is there some type of component that accepts custom serializers, like HBaseSink
?
If I write a sink, do I just use Flume client SDK and call append(Event) or appendBatch(List) when processing incoming events?
It seems like you need run two Flume agents:
Agent1: HtmlPagesSource -> channel -> PageParser (extends AvroSink and overrides process method that can parse input and write many slim messages)
Agent2: AvroSource -> channel -> WhateverSinkGoesNext
Look for some examples of chaining Flume data flows: http://www.ibm.com/developerworks/library/bd-flumews/#N10081