python performance loops logging temporary

How to run logic over several large apache log files in Python?

I have a bunch of apache log files, which I need to parse and extract information from. My script is working fine for a single file, but I'm wondering on the best approach to handle multiple files.

Should I:

- loop through all files and create a temporary file holding all contents
- run my logic on the "contact-ed" file

- loop through every file
- run my logic file by file
- try to merge the results of every file

Filewise I'm looking at logs of about a year, with roughly 2 million entries per day, reported for a large number of machines. My single-file script is generating an object with "entries" for every machine, so I'm wondering:

Question:
Should I generate a joint temporary file or run file-by-file, generate file-based-objects and merge x files with entries for the same y machines?

Solution

You could use glob and the fileinput module to effectively loop over all of them and see it as one "large file":

import fileinput
from glob import glob

log_files = glob('/some/dir/with/logs/*.log')
for line in fileinput.input(log_files):
    pass # do something