Search code examples
apache-kafkaflumeavroflume-ngkafka-consumer-api

Kafka source vs Avro source for reading and writing data into kafka channel using flume


In flume, I have Kafka-channel from where I can read and write data. What is the difference between the performance of reading and writing data into Kafka channel if I replace Kafka source and Kafka sink with Avro source and Avro sink?

In my opinion, by replacing Kafka-source with Avro-source, I will be unable to read data in parallel from multiple partitions of Kafka broker, as there is no consumer group specified in case of Avro-source. Please correct me if I am wrong.


Solution

  • In Flume, the Avro RPC source binds to a specified TCP port of a network interface, so only one Avro source of one of the Flume agents running on a single machine can ever receive events sent to this port.

    Avro source is meant to connect two or more Flume agents together: one or more Avro sinks connect to a single Avro source.

    As you point out, using Kafka as a source allows for events to be received by several consumer groups. However, my experience with Flume 1.6.0 is that it is faster to push events from one Flume agent to another on a remote host through Avro RPC rather than through Kafka.

    So I ended up with the following setup for log data collection:

    [Flume agent on remote collected node] =Avro RPC=> [Flume agent in central cluster] =Kafka=> [multiple consumer groups in central cluster]

    This way, I got better log ingestion and processing throughput and I also could encrypt and compress log data between remote sites and central cluster. This may however change when Flume adds support for the new protocol introduced by Kafka 0.9.0 in a future version, possibly making Kafka more usable as the front interface of the central cluster with remote data collection nodes (see here).