How to decide the flume topology approach?

I am setting up flume but very not sure of what topology to go ahead with for our use case.

We basically have two web servers which can generate logs at the speed of 2000 entries per second. Each entry of size around 137Bytes.

Currently we have used rsyslog( writing to a tcp port) to which a php script writes these logs to. And we are running a local flume agent on each webserver , these local agents listen to a tcp port and put data directly in hdfs.

So localhost:tcpport is the "flume source " and "hdfs" is the flume sink.

I am not sure about the above approach and am confused between three approaches:

Approach 1: Web Server, RSyslog & Flume Agent on each machine and a Flume collector running on the Namenode in hadoop cluster, to collect the data and dump into hdfs.

Approach 2: Web Server, RSyslog on same machine and a Flume collector (listening on a remote port for events written by rsyslog on web server)running on the Namenode in hadoop cluster, to collect the data and dump into hdfs.

Approach 3: Web Server, RSyslog & Flume Agent on same machine. And all agents writing directly to the hdfs.

Also, we are using hive, so we are writing directly into partitioned directories. So we want to think of an approach that allows us to write on Hourly partitions.

Basically I just want to know If people have used flume for similar purposes and if it is the right and reliable tool and if my approach seems sensible.

I hope that's not too vague. Any help would be appreciated.

Solution

The typical suggestion for your problem would be to have a fan-in or converging-flow agent deployment model. (Google for "flume fan in" for more details). In this model, you would ideally have an agent on each webserver. Each of those agents forward the events to few aggregator or collector agents. The aggregator agents then forward the events to a final destination agent that writes to HDFS.

This tiered architecture allows you to simplify scaling, failover etc.