I have a bunch of apache log files, which I need to parse and extract information from. My script is working fine for a single file, but I'm wondering on the best approach to handle multiple files.
Should I:
- loop through all files and create a temporary file holding all contents
- run my logic on the "contact-ed" file
Or
- loop through every file
- run my logic file by file
- try to merge the results of every file
Filewise I'm looking at logs of about a year, with roughly 2 million entries per day, reported for a large number of machines. My single-file script is generating an object with "entries" for every machine, so I'm wondering:
Question:
Should I generate a joint temporary file or run file-by-file, generate file-based-objects and merge x files with entries for the same y machines?
You could use glob
and the fileinput
module to effectively loop over all of them and see it as one "large file":
import fileinput
from glob import glob
log_files = glob('/some/dir/with/logs/*.log')
for line in fileinput.input(log_files):
pass # do something