Search code examples
rubyparsinglogfiles

Incrementally reading logs


Looked around with numerous search strings but can't find anything quite like this:

I'm writing a custom log parser (ala analog or webalizer except not for webserver) and I want to be able to skip the hard work for the lines that have already been parsed. I have thought about using a history file like webalizer but have no idea how it actually works internally and my C is pretty poor.

I've considered hashing each line and writing the hashes out, then parsing the history file for their presence but I think this will perform poorly.

The only other method I can think of is storing the line number of the last parse and skipping until that number is reached the next time round. What happens when the log is rotated I am not sure.

Any other ideas would be appreciated. I will be writing the parser in ruby but tips in a similar language will help as well.


Solution

  • The solutions I can think of right now are bound to be brittle.

    Even if you store the line number and later realize it would be past the length of the current file, what happens if old lines have been trimmed? You would start reading (well) after the last position.

    If, on the other hand, you are sure your log files won't be tampered with and they will only be rotated, I only see two ways of doing what you want, and I'm not sure the second is applicable to you.

    Anyway, here goes.

    First solution

    You store the last line you parsed along with a timestamp. At the next run, you consider all the rotated log files sorting them by their last modified date, figure out which one you read last time, and start reading from there.

    I didn't think this through, there might be funny corner cases you will need to handle.

    Second solution

    You create a background script that continuously watches the log file. A quick search on Google turned out this gem, but I'm not sure if that's even an option for you. Even then, you might want to integrate this solution with the previous one just in case your daemon will get interrupted (because that's clearly bound to happen at some point).