Search code examples
databaseamazon-web-serviceshadoopdata-warehousebigdata

I have 2 GB of web server logs produced each day how to filter them?


I have web server where other sites redirect to with some GET parameters. My situation:

  • Currently I have 2 GB of web server logs produced each day.
  • I need to filter the logs for at least half of year (~350 GB of logs).
  • I'm using Amazon infrastructure to store the logs on S3 bucket. I have two web servers that is writing the logs.

Which technology should I use to query/filter that data? Previously I download files on one ubuntu machine and then grep it to get the results. I also tested Hadoop over AWS but I found it difficult to use.

What technology/solution is best in terms of:

  1. Speed of filtering
  2. Easy to learn
  3. Easy to change the rules of filtering

Thank you for your attention to this matter


Solution

  • You can use AWS cloud watch log stream; correctly it's created for your needs. You can create log stream and with the small code on your client side( your web server), you can automatically push logs to cloud watch.

    After sending the log data to cloud watch, you can search, filter, create metrics and dashboard from your log files.

    For example, you want to count all "failed login", in your logs, or you want to calculate your web server downstream size or any other metrics.

    It's very easy and fast.

    also with cloud watch, you can create an alert and receive the alert when somethings happen in your log files.

    Finally, you can create a beatifull dashboard from your logs metrics.

    Enjoy Cloud Watch!!

    for more information:

    https://aws.amazon.com/cloudwatch/

    http://docs.aws.amazon.com/AmazonCloudWatch/latest/DeveloperGuide/WhatIsCloudWatchLogs.html