Search code examples
hadoopapache-kafkaflume-ng

Big Data ingestion - Flafka use cases


I have seen that the Big Data community is very hot in using Flafka in many ways for data ingestion but I haven't really gotten why yet.

A simple example I have developed to better understand this is to ingest Twitter data and move them to multiple sinks(HDFS, Storm, HBase).

I have done the implementation for the ingestion part in the following two ways: (1) Plain Kafka Java Producer with multiple consumers (2) Flume agent #1 (Twitter source+Kafka sink) | (potential) Flume agent #2(Kafka source+multiple sinks). I haven't really seen any difference in the complexity of developing any of these solutions(not a production system I can't comment on performance) - only what I found online is that a good use case for Flafka would be for data from multiple sources that need aggregating in one place before getting consumed in different places.

Can someone explain why would I use Flume+Kafka over plain Kafka or plain Flume?


Solution

  • People usually combine Flume and Kafka, because Flume has a great (and battle-tested) set of connectors (HDFS, Twitter, HBase, etc.) and Kafka brings resilience. Also, Kafka helps distributing Flume events between nodes.

    EDIT:

    Kafka replicates the log for each topic's partitions across a configurable number of servers (you can set this replication factor on a topic-by-topic basis). This allows automatic failover to these replicas when a server in the cluster fails so messages remain available in the presence of failures. -- https://kafka.apache.org/documentation#replication

    Thus, as soon as Flume gets the message to Kafka, you have a guarantee that your data won't be lost. NB: you can integrate Kafka with Flume at every stage of your ingestion (ie. Kafka can be used as a source, channel and sink, too).