Search code examples
cmultithreadingin-memorytext-segmentationlog-analysis

What is the fastest way to search for patterns through 20-30 GB of multiple logfiles


I am performing log analysis, which I want to automate so that it runs daily and reports findings. The analysis runs on standard workstations, 8 cores, up to 32 GB of free RAM. The prototyping is based on GNU Grep (--mmap), Sqlite (on a RAM disk) and Bash (for parameters).

One problem with this is that I need to go through the files multiple times. If I find a pattern match, I search upwards for related things. This might get recursive and each time it re-reads Gigabytes of data.

Is there any fast way / lib in C for memory backed segment wise multi-threaded file reading/writing?

When I look at the "in memory" search (to go up and down within a loaded segment, or to load more in case this is necessary) I get the feeling that this is a very general requirement.


Solution

  • Look for the Tim Bray's Wide Finder Project. It has surprisingly simple and versatile solution in Perl by Sean O'Rourke. It mmaps log into memory and then forks subprocesses for searching. The fact, that you have accessible whole log file in each child process so you can flexible going forward and backward across initial partitions is what makes it very versatile. You can do it in C in the same manner, but I recommend use Perl first to test the concept and then rewrite to C if you are not satisfied. Personally I would go from Perl POC to Erlang + C NIF just because my personal preferences. (Erlang solutions in WF project doesn't use NIFs.)

    Or if you have a lot of money to afford splunk>, it's way to go.