Architecture of syncing logs to hadoop

I have a different environments across a few Cloud providers, like windows servers, linux servers in rackspace, aws..etc. And there is a firewall between that and internal network.

I need to build a real time servers environment where all the newly generated IIS logs, apache logs will be sync to an internal big data environment.

I know there are tools like Splunk or Sumologic that might help but we are required to implement this logic in open source technologies. Due to the existence of the firewall, I am assuming I can only pull the logs instead push from the cloud providers.

Can anyone share with me what is the rule of thumb or common architecture for sync up tons of logs in NRT (near real time)? I heard of Apache Flume, Kafka and wondering if those are required or it is just a matter of using something like rsync.

Solution

You can use rsync to get the logs but you can't analyze them in the way Spark Streaming or Apache Storm does.

You can go ahead with one of these two options.

Apache Spark Streaming + Kafka

Apache Storm + Kakfa

Have a look at this article about integration approaches of these two options.

Have a look this presentation, which covers in-depth analysis of Spark Streaming and Apache Storm.

Performance is dependent on your use case. Spark Steaming is 40x faster to Storm processing. But if you add "reliability" as key criteria, then data should be moved into HDFS first before processing by Spark Streaming. It will reduce final throughput.

Reliability Limitations: Apache Storm

Exactly once processing requires a durable data source.
At least once processing requires a reliable data source.
An unreliable data source can be wrapped to provide additional guarantees.
With durable and reliable sources, Storm will not drop data.
Common pattern: Back unreliable data sources with Apache Kafka (minor latency hit traded for 100% durability).

Reliability Limitations: Spark Streaming

Fault tolerance and reliability guarantees require HDFS-backed data source.
Moving data to HDFS prior to stream processing introduces additional latency.
Network data sources (Kafka, etc.) are vulnerable to data loss in the event of a worker node failure.