Search code examples
hadoopmapreduceapache-sparkapache-stormflume

Which is better for log analysis


I have to analyze Gzip compressed log files which are stored on a production server using Hadoop related tools .

I can't decide on how to do that, and what to use, here are some of the methods i thought about using (Feel free to recommend something else):

  • Flume
  • Kafka
  • Map reduce

Before i could do anything, i need to get the compressed files from the production server and process them then push them into Apache HBase


Solution

  • Depending on the size of your logs (assuming that the computation won't fit on a single machine, i.e. requires a "big data" product), I think it might be most appropriate to go with Apache Spark. Given that you don't know much about the ecosystem it might be best to go with Databricks Cloud, which will give you a straightforward way of reading your logs from HDFS and analyzing using Spark transformations in a visual way (with a Notebook).

    You can find this video on the link above.
    There's a free trial so you can see how that would go and then decide.

    PS I'm in no way affiliated with Databricks. Just think they have a great product, that's all :)